Mentor - Mr. Rohit Raj
Members
For industries around the world, accidents in the work place are of a major concern, since it affects the lives and well being of their employees, contractors and their families and the industry faces loses in terms of hospital charges, litigation fees, reputation and lost employee morale. Based on these facts it is intented to build a chatbot that can highlight the safety risk as per the incident description to the professionals including:
1.Personnel from the safety and complaince team
2.Senior management from the plant
3.Personnel from other plants across the globe
4.Government and industrial safety groups 5.Anyone interested or doing research in industrial safety
6.Emergency health and safety teams
7.Fire safety and industrial hazard teams
8.General management
9.Other personnel requiring safety risk information
so that these professionals can:
Take preventive and proactive measures based on past history React faster to employee satisfaction realated to safety Help postion the equipment and machinery in a safe place where risk of potential acceidents can be minimised Gain insights about safety in industries safety is paramound Reduce insurance costs by better handling of personnel, equipment and other resources Take other safety related decisions and actions
The user should be able to input an incident description and the chatbot should be able to predict the potential accident or vulnerability levels which can be extended or configured to different scenarios
The dataset basically describes the accident incidents from twelve different plants across three different countries and consists of four hundred and twenty five records It has the following columns:
Date: timestamp or time/date information
Countries: Which country the accident occurred (anonymised)
Local: The city where the manufacturing plant is located (anonymised)
Industry sector: Which sector the plant belongs to
Accident level: From I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
Potential Accident Level: From I to VI, depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
Gender: If the person involved is male of female
Employee or Third Party: If the injured person is an employee or a third party / contractor
Critical Risk: Description of the risk involved in the accident
Description: Detailed description of how the accident happened
On inspection of the dataset it appears that:
1.The dataset is limited and consists of four hundred and twenty five records only so training the models with high accuracy could be a challange
2.The dataset is imbalanced on certain variables like potential accident level and accident level, this means that we may not get consistent results unless the dataset is treated to reduce imbalance
3.Minor accidents are more common than major accidents, this looks similar to real world situations
4.There is data from three countries
5.There are twelve locals or cities from which the data is taken
6.There are two industry sectors - mining, metals and third all others grouped together as others
7.There are five accident levels
8.There are six potential accident levels
9.There are employees, third parties and remote third parties involved in the accidents
10.There are thirty three diffrent types of critical risk one of which has been assigned to a accident incident
11.The accident description is highly unclean and so it will require a considerable amount of effort to clean it to produce results
12.The dataset consists of data from January 2016 to July 2017
13.Males are more involved than females in accidents, this too looks similar to real world situations as there are considerably lower number of females working in industrial environments
Approach - We have agreed on designing a Chatbot capability using slack as an UI interface integrating with RASA and API that triggers the underlying NLP Model that gets build
We have established agreed intermediate goals and progressed on the below process steps
As part of the NLP Model building we have adoptped the below process steps
Data processing techniques Data cleansing Features engineering Lematizing,stemming Removing stop words and Glove embedding Data visualization with charts to be able to see clearly how the data is spread across different dimentions with univariate, bi and multi variate analysis Model designing - As part of model designing we have designed and trained the below models
Random Forest Gradient Boosting Lgistic regerssion SVM and Neural Network classifiers such as
RNN LSTM and Bi-directional LSTM FastText and we are fine tuning and evaluating the best performing model to be shipped for the API that gets triggered from Slack user interface Findings From the data analysis we could infer that
Many Body related actions and accidenrs have been found A lot of equipment related accidents cited in the dataset Poor features with lack of quality or inadequate data resulting in class imbalance
Since the data shows that the Accident severty is Low for Critical we will have to consider both Accident level as well as Potential accident level for the Model prediction
# Importing the required libraries
import plotly
print(plotly.__version__)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default='notebook'
from sklearn.impute import SimpleImputer
import spacy
# Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, gc
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
# Models
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score, learning_curve
# Display settings
pd.options.display.max_rows = 400
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
random_state = 42
np.random.seed(random_state)
# importing os for setting path
import os
working_dir = 'E:\\Great Learning\\DL\\Capstone\\Data\\'
os.chdir(working_dir)
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
5.3.1
# Loading the Data from drive
df = pd.read_csv('Data Set - industrial_safety_and_health_database_with_accidents_description.csv')
print(df.shape)
df.head()
(425, 11)
| Unnamed: 0 | Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee or Third Party | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
| 3 | 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... |
| 4 | 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... |
# Dropping the unwanted column
ds = df.copy()
ds.drop('Unnamed: 0',1, inplace = True)
# Checking the information of the Dataset
ds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 425 entries, 0 to 424 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data 425 non-null object 1 Countries 425 non-null object 2 Local 425 non-null object 3 Industry Sector 425 non-null object 4 Accident Level 425 non-null object 5 Potential Accident Level 425 non-null object 6 Genre 425 non-null object 7 Employee or Third Party 425 non-null object 8 Critical Risk 425 non-null object 9 Description 425 non-null object dtypes: object(10) memory usage: 33.3+ KB
# Displaying the columns
ds.columns
Index(['Data', 'Countries', 'Local', 'Industry Sector', 'Accident Level',
'Potential Accident Level', 'Genre', 'Employee or Third Party',
'Critical Risk', 'Description'],
dtype='object')
# Renaming the Features of the Dataset
ds.rename(columns= {'Data':'Date', 'Countries':'Country', 'Genre':'Gender',
'Employee or Third Party':'Employee type'}, inplace =True)
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
| 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... |
| 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... |
# Null value check
pd.DataFrame(ds.isnull().sum(), columns=['Missing value'])
| Missing value | |
|---|---|
| Date | 0 |
| Country | 0 |
| Local | 0 |
| Industry Sector | 0 |
| Accident Level | 0 |
| Potential Accident Level | 0 |
| Gender | 0 |
| Employee type | 0 |
| Critical Risk | 0 |
| Description | 0 |
ds['Date'] = pd.to_datetime(ds['Date']) # Creating the new feature Date to analyse the Accident
ds['Year'] = ds['Date'].apply(lambda x: x.year) # Creating the new feature Year to analyse the Accident
ds['Month'] = ds['Date'].apply(lambda x: x.month) # Creating the new feature Month to analyse the Accident
ds['Day'] = ds['Date'].apply(lambda x: x.day) # Creating the new feature Day to analyse the Accident
ds['Weekday'] = ds['Date'].apply(lambda x: x.day_name()) # Creating the new feature Weekday to analyse the Accident
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday |
| 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... | 2016 | 1 | 8 | Friday |
| 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... | 2016 | 1 | 10 | Sunday |
# Defining a function to creat new feature called Quater
def month_quater_conversion(x):
if x in [1, 2, 3]:
season = 'First'
elif x in [4, 5, 6]:
season = 'Second'
elif x in [7, 8, 9]:
season = 'Third'
elif x in [10, 11, 12]:
season = 'Fourth'
return season
ds['Quater'] = ds['Month'].apply(month_quater_conversion)
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | Quater | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | First |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | First |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday | First |
| 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... | 2016 | 1 | 8 | Friday | First |
| 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... | 2016 | 1 | 10 | Sunday | First |
# Converting the class of target to numeric
replace_value = {'I':1, 'IV':4, 'III':3, 'II':2, 'V':5}
ds['Accident Level'] = ds['Accident Level'].map(replace_value)
replace_value = {'IV':4, 'III':3, 'I':1, 'II':2, 'V':5, 'VI':5}
ds['Potential Accident Level'] = ds['Potential Accident Level'].map(replace_value)
del replace_value
# Analysing the categorical features
cats = ['Country', 'Local', 'Industry Sector', 'Accident Level',
'Potential Accident Level', 'Gender', 'Employee type', 'Critical Risk',
'Year', 'Month', 'Day', 'Weekday', 'Quater']
# Histogram of Country
fig = px.histogram(ds, x = 'Country', width=800, height=500, category_orders=dict(Country = ['Country_01','Country_02', 'Country_03']))
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Local with country
fig = px.histogram(ds, x = 'Local',color = 'Country', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Industry Sector
fig = px.histogram(ds, x = 'Industry Sector', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Accident Level
fig = px.histogram(ds, x = 'Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram Target Variable Potential Accident Level
fig = px.histogram(ds, width=800, height=500, x ='Potential Accident Level')
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Gender
fig = px.histogram(ds, x = 'Gender', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Employee type
fig = px.histogram(ds, x = 'Employee type',width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Critical Risk
fig = go.Figure(data = go.Histogram(y = ds['Critical Risk'].values))
fig.update_layout(bargap = .4)
fig.show()
# Histogram of Quater
fig = px.histogram(ds, x = 'Quater', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Sector which is most effected
fig = px.pie(ds, names='Industry Sector', template='seaborn')
fig.update_traces(rotation=90, pull=[0.2,0.03,0.1,0.03,0.1], textinfo="percent+label", showlegend=False)
fig.show()
#fig.show(renderer='colab')
# Potential Accident Level per country
fig = px.histogram(ds, x ='Country', color='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Country_01 is the most effected country and most of the classes of Potential Accident Level belongs to country_01
# Accident Level per country
fig = px.histogram(ds, x ='Country', color='Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Country_01 is the most effected country and most of the classes of Accident Level belongs to country_01
# Industry sector most effected by Potential Accident Level
fig = px.histogram(ds, x ='Industry Sector', color='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Mining sector is the most effected and severity level of Accidents also belongs to the same sector
# Potential Accident Level in in each Quater
fig = px.histogram(ds, color= 'Potential Accident Level', x='Quater', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
The First and second quater accounts for higher level of Accident which is level 4 and 5.
# Critical Risk vs Potential Accident Level
fig = px.histogram(ds, x ='Critical Risk', color='Potential Accident Level')
fig.update_layout(bargap = 0.2)
fig.show()
Most of the classes of Potential Accident Level are from other class of Critical Risk which is 232 in No.
The severity of the Potential Accident Level are from the class Fall, Electrical installation, Vehicles, Projection, Pressed and Mobile equipment.
# Critical Risk vs Industry Sector
fig = px.histogram(ds, x ='Critical Risk', color='Industry Sector')
fig.update_layout(bargap = 0.2)
fig.show()
Mining sector is the most effected sector and most of the classes of Critical Risk comes from this sector.
fig = px.histogram(ds, color ='Potential Accident Level', x='Accident Level', width=800, height=500) fig.update_layout(bargap = 0.2) fig.show()
Class 1 of the Accident Level accounts for most of the accident and reaches to all the classes of Potential Accident Level which is 1,2,3,4,5
# Employee type vs Potential Accident Level
fig = px.histogram(ds, color ='Potential Accident Level', x='Employee type', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Third Party and Employee are the most effected Employee type
# Industry sector vs Potential Accident Level, Gender and Accident Level
fig = px.bar(ds, x="Industry Sector", y="Accident Level", color="Gender", barmode="group", facet_col="Potential Accident Level")
fig.show()
Males are the most effected gender with Potential Accident Level 4 and 5 which is from Mining sector.
# Local with Employee type
fig = px.histogram(ds, color = 'Employee type', x ='Local', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Local_3 is the most effected city and most effected class of Employee type are Third Party and Employee.
# Local vs Industry Sector
fig = px.histogram(ds, color = 'Industry Sector', x ='Local', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Local 3 has highest number of Mining industry sector accident.
Local 5 has highest number of Metals industry sector accident.
All the Mining industry sector accidents happend in Local 1,2,3,4,7.
All the Metals industry sector accidents happend in Local 5,6,8,9 .
All the Others industry sector accidents happend in Local 10,11,12.
# Year vs Potential Accident Level
fig = px.histogram(ds, color = 'Year', x ='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Most of the Accidents happend in the year 2016 and lower in 2017.
# Year vs Industry Sector
fig = px.histogram(ds, color = 'Year', x ='Industry Sector', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Most of the Mining Accidents happend in the year 2016 and lower in 2017.
# Randomly visualizing the description variable with some levels
num = np.random.randint(0, ds.shape[0])
discription = ds.loc[num, 'Description']
industry = ds.loc[num, 'Industry Sector']
accident_severity = ds.loc[num, 'Accident Level']
potential_severity = ds.loc[num, 'Potential Accident Level']
employee_type = ds.loc[num, 'Employee type']
critical_risk = ds.loc[num, 'Critical Risk']
print(discription)
print(' ')
print(industry)
print(accident_severity)
print(potential_severity)
print(industry)
print(employee_type)
print(critical_risk)
Mr. Jesus operator of the concrete throwing team (alpha N ° 18) was shooting shotcrete in the Cx work. 001 Nv. 1710 OB1. applying 0.5 m3, he realizes that the additive did not come out in the mix, directing to lift the cover of the passage valve (54 Cm x 53 Cm of ¼ inch of thickness approximately 15 Kg). verifying that the valve was open, release the lid and it hits to the third finger of the left hand against the base, causing the injury. Mining 3 4 Mining Third Party Others
# Checking the max Description lenght before the cleaning
max_description_len = max([len(i.split()) for i in ds['Description']])
print('Max description length:', max_description_len)
Max description length: 183
# Checking the min Description lenght before the cleaning
max_description_len = min([len(i.split()) for i in ds['Description']])
print('Min description length:', max_description_len)
Min description length: 16
# Removing HTML tags
from bs4 import BeautifulSoup
def strip_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
stripped_text = soup.get_text()
return stripped_text
# Removing Accented characters
import unicodedata
def remove_accented_chars(text):
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text
# Remove special characters
import re
def remove_special_characters(text, remove_digits=False):
#Using regex
pattern = r'[^a-zA- z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
text = re.sub(pattern, '', text)
return text
# Lemmatization
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer, PorterStemmer
[nltk_data] Downloading package wordnet to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
# Function to clean the text
def normalize_corpus(doc, html_stripping=True, accented_char_removal=True, text_lower_case=True,
special_char_removal=True, stopword_removal=True, remove_digits=True):
#normalized_corpus = []
# normalize each document in the corpus
#for doc in corpus:
# strip HTML
if html_stripping:
doc = strip_html_tags(doc)
# remove accented characters
if accented_char_removal:
doc = remove_accented_chars(doc)
# lowercase the text
if text_lower_case:
doc = doc.lower()
# remove extra newlines
doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
# remove special characters and\or digits
if special_char_removal:
# insert spaces between special characters to isolate them
special_char_pattern = re.compile(r'([{.(-)!}])')
doc = special_char_pattern.sub(" \\1 ", doc)
doc = remove_special_characters(doc, remove_digits=remove_digits)
# remove extra whitespace
doc = re.sub(' +', ' ', doc)
#normalized_corpus.append(doc)
return doc
# Applying the function to feature Description
ds['clean_Description'] = ds['Description'].map(lambda x: normalize_corpus(x))
print(ds['clean_Description'][0:10])
0 while removing the drill rod of the jumbo for ... 1 during the activation of a sodium sulphide pum... 2 in the substation milpo located at level when ... 3 being am approximately in the nv cx ob the per... 4 approximately at a m in circumstances that the... 5 during the unloading operation of the ustulado... 6 the collaborator reports that he was on street... 7 at approximately p m when the mechanic technic... 8 employee was sitting in the resting area at le... 9 at the moment the forklift operator went to ma... Name: clean_Description, dtype: object
# This function lemmatizes the text
for dependency in ("brown", "names", "wordnet", "averaged_perceptron_tagger", "universal_tagset",'stopwords','punkt','words'):
nltk.download(dependency)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#from nltk.stem.snowball import SnowballStemmer
from nltk import pos_tag, word_tokenize
#stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
#print(pos_tag(word_tokenize(text)))
lemmatized_text = ''
for word, tag in pos_tag(word_tokenize(text)):
#print(tag)
wnltag = tag[0].lower()
wnltag = wnltag if wnltag in ['a', 'r', 'n', 'v'] else None
if not wnltag:
lemma = word
else:
lemma = lemmatizer.lemmatize(word, wnltag)
lemmatized_text = lemmatized_text + ' ' + lemma
#tkns = nltk.word_tokenize(text)
#lemmatized_text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
#lemmatized_text = lemmatizer.lemmatize(text) #' '.join([lemmatizer.lemmatize(words) for words in tkns])
return lemmatized_text.lstrip()
[nltk_data] Downloading package brown to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package brown is already up-to-date! [nltk_data] Downloading package names to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package names is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package universal_tagset to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package universal_tagset is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package words to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package words is already up-to-date!
# # Removing stopwords
#nltk.download('stopwords')
#from nltk.corpus import stopwords
#stopwords = set(stopwords.words('english'))
#ds['clean_Description'] = ds['clean_Description'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))
lemmatize_words(ds['clean_Description'][0])
'while remove the drill rod of the jumbo for maintenance the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal see this the mechanic support one end on the drill of the equipment to pull with both hand the bar and accelerate the removal from this at this moment the bar slide from its point of support and tighten the finger of the mechanic between the drilling bar and the beam of the jumbo'
# This function removes stop words
stop_words = stopwords.words('english')
def remove_stopwords(text):
text_wo_stop_words = " ".join([word for word in str(text).split() if word not in stop_words])
return text_wo_stop_words
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: remove_stopwords(x))
# Removing single character
def remove_single_char(text):
pattern = r'\s+[a-zA-Z]\s+'
text = re.sub(pattern, ' ', text)
return text
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: remove_single_char(x))
# Remove two short character
def two_character(text):
pattern = r'\W*\b\w{1,2}\b'
text = re.sub(pattern, '', text)
return text
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: two_character(x))
#def lemmatize_text(text):
# lemmatizer = WordNetLemmatizer()
# return ' '.join([lemmatizer.lemmatize(word) for word in text.split() if len(word)>3])
#ds['claen_Description'] = ds['clean_Description'].apply(lambda x: lemmatize_text(x))
ds['claen_Description'] = ds['clean_Description'].apply(lambda x: lemmatize_words(x))
# Using spacy library for Lemmatizing the text
import spacy
nlp = spacy.load('en_core_web_sm')
# Creating a function for lemmatizing the text
#def lemmatization(text):
# doc = nlp(text)
# doc = ' '.join([word.lemma_ for word in doc if len(word)>3])
# return doc
ds['claen_Description'][0]
'remove drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal see mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drill bar beam jumbo'
#ds['claen_Description'] = ds['clean_Description'].apply(lambda x: lemmatization(x))
#ds['claen_Description'] = ds['clean_Description'].apply(lambda x: lemmatize_words(x))
# Randomly vizualizing the clean corpus
num = np.random.randint(0, ds.shape[0])
#clean_desc = ds.loc[num, 'clean_Description']
clean_desc = ds.loc[num, 'claen_Description']
clas = ds.loc[num, 'Potential Accident Level']
print(clas)
print(' ')
print(clean_desc)
3 approx collaborator duval sampler prepare change remove bucket pulp sample plant courier slip fell ground support right hand generate lesion describe
from collections import Counter
#count = Counter(ds['clean_Description'])
count = Counter(ds['claen_Description'][0:10])
count
Counter({'remove drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal see mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drill bar beam jumbo': 1,
'activation sodium sulphide pump pip uncoupled sulfide solution design area reach maid immediately make use emergency shower direct ambulatory doctor later hospital note sulphide solution gram liter': 1,
'substation milpo locate level collaborator excavation work pick hand tool hit rock flat part beak bounce hit steel tip safety shoe metatarsal area leave foot collaborator cause injury': 1,
'approximately personnel begin task unlocking soquet bolt bhb machine penultimate bolt identify hexagonal head worn proceed cristobal auxiliary assistant climb platform exert pressure hand dado key prevent come bolt moment two collaborator rotate lever anticlockwise direction leave key bolt hit palm leave hand cause injury': 1,
'approximately circumstance mechanic anthony group leader eduardo eric fernandezinjuredthe three company impromec perform removal pulley motor pump zaf marcy length weight lock proceed heat pulley loosen come fall distance meter high hit instep right foot worker cause injury describe': 1,
'unload operation ustulado bag need unclog discharge mouth silo truck perform procedure maneuver unhook hose without total depressurisation mouth project ustulado powder collaborator cause irritation eye': 1,
'collaborator report street hold left hand volumetric balloon slip place hand ground volumetric balloon end break cause small wound leave hand': 1,
'approximately mechanic technician jose tecnomin verify transmission belt pump acid plant proceed turn pulley manually unexpectedly instant electrician supervisor miguel eka mining grabs transmission belt verify tension point finger trap': 1,
'employee sit rest area level raise bore suffered sudden illness fall suffer excoriation face': 1,
'moment forklift operator go manipulate big bag bioxide section front ladder lead area manual displacement splash spent height forehead fissure pipe subsequently spill leave eye collaborator go nearby eyewash clean immediately medical center': 1})
remove_list = ['approximately','assisting','activity','approx','approximately','area','auxiliary' 'circumstance',
'carrying','collaborator','employee','discharging','performing','performed','report','execution','field','geologo',
'hour','hrs','level', 'access','loading','maintenance','xxx','moment','operator','parking','phase','preparation',
'preparing','process','technician','time','transport','upon','withdrawal','worker','carried',
'circumstance','duval','fernando', 'chagua', 'bodeguero ','luciano', 'silva','messrs', 'roger','acl','jhon',
'milton','ran', 'branch','snack','reevaluation', 'bundle', 'maslucan','fragmentos','sailor','pants','scorpion','becker','wheelbarrow','thugs','marimbondo',
'roy','canario','wila','prong','auxiliar','ajax','spoon','threeway','new','withdrew','granja','nascimento','povoado',
'martinopole','vista','ematoma','transfe','psi','tires','thunderous','cue','alcohotest','laquia','laden','quirodactilo','burr',
'grille','leans','rampa','carousel','eka','miguel','frontal','tirford','ferranta','alex','pickaxe','dds','tirfor','click',
'carlos','tyrfor','treads','quinoa','sheepskin','extra','semikneeling','boss','cristobal','dado','bhb','demag','tubo',
'jetanol','winche','jackleg','kevin','facila','chiropactyl','bowl','servant','knuckles','spume','nipple','wick','embed',
'ponchos','prongs','tips','job','resane','macedonio','taut','talus','pivot','atricion','oba','ones','shortcreteados','acc','rivets',
'sump','lajes','costa','pig','lay','aeq','ton','ydrs','shake','lit','retracts','catheter','speart','zamac','ingots','watermelon','beak',
'fectuaban','roman','milpo','luiz','amg','hematoma','mangote','pablo','potions','dtn','zinco','elismar','carmen','cats','brapdd',
'lloclla','one','work','mesh','contact','key','came','locomotive','basket','master','epps','gps','bee','set','onto','mixkret','bees','order','conveyor',
'cat']
#ds['clean_Description'] = ds['clean_Description'].apply(lambda x: ' '.join([words for words in x.split() if words not in remove_list]))
ds['claen_Description'] = ds['claen_Description'].apply(lambda x: ' '.join([words for words in x.split() if words not in remove_list]))
# Analysing the N-Grams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def count_pos(text):
doc = nlp(str(text))
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])
for pos, count in counts_dict.items():
human_readable_tag = doc.vocab[pos].text
print(human_readable_tag, count)
#count_pos(ds['clean_Description'])
count_pos(ds['claen_Description'])
NUM 11 SPACE 21 NOUN 46 PROPN 15 PUNCT 16 VERB 4 ADJ 5 INTJ 1 ADP 1
def find_persons(text):
doc2 = nlp(str(text))
persons = [ent.text for ent in doc2.ents if ent.label_ == 'PERSON']
return persons
find_persons(ds['clean_Description'])
['kelly towa']
#count = Counter(ds['clean_Description'])
count = Counter(ds['claen_Description'][0:10])
count
Counter({'remove drill rod jumbo supervisor proceeds loosen support intermediate centralizer facilitate removal see mechanic support end drill equipment pull hand bar accelerate removal bar slide point support tightens finger mechanic drill bar beam jumbo': 1,
'activation sodium sulphide pump pip uncoupled sulfide solution design reach maid immediately make use emergency shower direct ambulatory doctor later hospital note sulphide solution gram liter': 1,
'substation locate excavation pick hand tool hit rock flat part bounce hit steel tip safety shoe metatarsal leave foot cause injury': 1,
'personnel begin task unlocking soquet bolt machine penultimate bolt identify hexagonal head worn proceed auxiliary assistant climb platform exert pressure hand prevent come bolt two rotate lever anticlockwise direction leave bolt hit palm leave hand cause injury': 1,
'mechanic anthony group leader eduardo eric fernandezinjuredthe three company impromec perform removal pulley motor pump zaf marcy length weight lock proceed heat pulley loosen come fall distance meter high hit instep right foot cause injury describe': 1,
'unload operation ustulado bag need unclog discharge mouth silo truck perform procedure maneuver unhook hose without total depressurisation mouth project ustulado powder cause irritation eye': 1,
'street hold left hand volumetric balloon slip place hand ground volumetric balloon end break cause small wound leave hand': 1,
'mechanic jose tecnomin verify transmission belt pump acid plant proceed turn pulley manually unexpectedly instant electrician supervisor mining grabs transmission belt verify tension point finger trap': 1,
'sit rest raise bore suffered sudden illness fall suffer excoriation face': 1,
'forklift go manipulate big bag bioxide section front ladder lead manual displacement splash spent height forehead fissure pipe subsequently spill leave eye go nearby eyewash clean immediately medical center': 1})
ds['Description_length'] = [len(i.split()) for i in ds['claen_Description']]
#ds['Description_length'] = [len(i.split()) for i in ds['clean_Description']]
# Checking the max length of the claen_Description after cleaning
#max_description_len = max([len(i.split()) for i in ds['clean_Description']])
max_description_len = max([len(i.split()) for i in ds['claen_Description']])
print('Max description length:', max_description_len)
Max description length: 85
# Checking the min length of the claen_Description after cleaning
#min_description_len = min([len(i.split()) for i in ds['clean_Description']])
min_description_len = min([len(i.split()) for i in ds['claen_Description']])
print('Min description length:', min_description_len)
Min description length: 7
# Checking the Mean length of the claen_Description after cleaning
Mean_description_len = ds['Description_length'].mean()
print('Mean description length:', Mean_description_len)
Mean description length: 27.63529411764706
# Analysing the N-Grams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def plot_top_ngrams_barchart(text, n=2):
new= text.str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
def get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams= get_top_ngram(text,n)[:10]
x,y=map(list,zip(*top_n_bigrams))
fig = px.bar(x=x,y=y, width=800, height=500)
fig.update_layout(bargap=0.2)
fig.show()
# Ploting the word with Two N-Gram
#plot_top_ngrams_barchart(ds['clean_Description'], 2)
plot_top_ngrams_barchart(ds['claen_Description'], 2)
# # Ploting the word with Three N-Gram
#plot_top_ngrams_barchart(ds['clean_Description'], 3)
plot_top_ngrams_barchart(ds['claen_Description'], 3)
# # Ploting the word with Four N-Gram
#plot_top_ngrams_barchart(ds['clean_Description'], 4)
plot_top_ngrams_barchart(ds['claen_Description'], 4)
# Defining the input and target variable
#X = ds['clean_Description']
X = ds['claen_Description']
y = ds['Potential Accident Level']
y.unique()
array([4, 3, 1, 2, 5], dtype=int64)
# Spliting the Data into Train and Test
from sklearn.model_selection import train_test_split
cvt = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=5, sublinear_tf=True)
Xc = cvt.fit_transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(Xc, y, test_size = 0.15, random_state =1, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(' ')
print(X_test.shape)
print(y_test.shape)
(361, 631) (361,) (64, 631) (64,)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
rfcl = RandomForestClassifier(random_state=1)
rfcl.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
ytest_pred = rfcl.predict(X_test)
acc_rfc = accuracy_score(y_test, ytest_pred)
acc_rfc_tr = rfcl.score(X_train,y_train)
print("Train Accuracy of the Random Forest model : {:.2f}".format(acc_rfc_tr*100))
print("Test Accuracy of the Random Forest model : {:.2f}".format(acc_rfc*100))
Train Accuracy of the Random Forest model : 99.72 Test Accuracy of the Random Forest model : 40.62
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100)
gbc.fit(X_train, y_train)
ytest_pred_gbc = gbc.predict(X_test)
acc_gbc = accuracy_score(y_test, ytest_pred_gbc)
acc_gbc_tr = gbc.score(X_train,y_train)
print(" Test accuracy of the Gradient boosting model : {:.2f}".format(acc_gbc*100))
print("Train accuracy of the Gradient boosting model : {:.2f}".format(acc_gbc_tr*100))
Test accuracy of the Gradient boosting model : 43.75 Train accuracy of the Gradient boosting model : 99.72
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(class_weight='balanced', max_iter=1000)
lr.fit(X_train, y_train)
ytest_pred = lr.predict(X_test)
acc_lr = accuracy_score(y_test, ytest_pred)
acc_lr_tr = gbc.score(X_train,y_train)
print(" Test accuracy of the LR model : {:.2f}".format(acc_lr*100))
print("Train accuracy of the LR model : {:.2f}".format(acc_lr_tr*100))
Test accuracy of the LR model : 39.06 Train accuracy of the LR model : 99.72
from sklearn.svm import LinearSVC
svc = LinearSVC( max_iter=5000)
svc.fit(X_train, y_train)
ytest_pred = svc.predict(X_test)
# Evaluation
acc_svc = accuracy_score(y_test, ytest_pred)
acc_svc_tr = svc.score(X_train, y_train)
print("Train accuracy of the SVC model : {:.2f}".format(acc_svc_tr*100))
print("Test accuracy of the SVC model : {:.2f}".format(acc_svc*100))
Train accuracy of the SVC model : 99.72 Test accuracy of the SVC model : 43.75
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, xticklabels=['1','2','3','4','5'],
yticklabels = ['1','2','3','4','5']).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
print_confusion_matrix(y_test, ytest_pred_gbc)
print(classification_report(y_test, ytest_pred_gbc))
precision recall f1-score support
1 0.33 0.25 0.29 8
2 0.50 0.35 0.41 20
3 0.36 0.42 0.38 12
4 0.52 0.61 0.56 23
5 0.00 0.00 0.00 1
accuracy 0.44 64
macro avg 0.34 0.33 0.33 64
weighted avg 0.45 0.44 0.44 64
import tensorflow
from tensorflow.keras.layers import Bidirectional, Dense, Embedding, Dropout, Flatten, GlobalAveragePooling1D, BatchNormalization, LSTM, GlobalMaxPooling1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.backend import clear_session
from tensorflow.keras.initializers import Constant
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import TimeDistributed
#
#X = ds['clean_Description']
X = ds['claen_Description']
# Converting the target to one hot for keras model
y = pd.get_dummies(ds['Potential Accident Level']).values
y[0]
array([0, 0, 0, 1, 0], dtype=uint8)
# Spliting the Data for Neural Networks Model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state =1, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(' ')
print(X_test.shape)
print(y_test.shape)
(382,) (382, 5) (43,) (43, 5)
# Defining the parameters and indices for words
max_features = 10000 # Initiating with 10000 features
tokenizer = Tokenizer(num_words= max_features)
# Fitting the tokenizer on Training input feature
tokenizer.fit_on_texts(X_train.tolist())
print(tokenizer.word_index) # Words with its index
{'cause': 1, 'hand': 2, 'leave': 3, 'right': 4, 'injury': 5, 'use': 6, 'hit': 7, 'equipment': 8, 'assistant': 9, 'pipe': 10, 'fall': 11, 'perform': 12, 'accident': 13, 'finger': 14, 'make': 15, 'support': 16, 'floor': 17, 'move': 18, 'cut': 19, 'rock': 20, 'remove': 21, 'safety': 22, 'place': 23, 'meter': 24, 'glove': 25, 'part': 26, 'team': 27, 'position': 28, 'truck': 29, 'side': 30, 'face': 31, 'height': 32, 'pump': 33, 'drill': 34, 'impact': 35, 'carry': 36, 'metal': 37, 'release': 38, 'generate': 39, 'towards': 40, 'platform': 41, 'medical': 42, 'two': 43, 'come': 44, 'mechanic': 45, 'slip': 46, 'end': 47, 'point': 48, 'take': 49, 'describe': 50, 'return': 51, 'press': 52, 'fragment': 53, 'project': 54, 'hold': 55, 'try': 56, 'bolt': 57, 'foot': 58, 'block': 59, 'inside': 60, 'back': 61, 'front': 62, 'reach': 63, 'company': 64, 'lift': 65, 'clean': 66, 'load': 67, 'step': 68, 'plate': 69, 'vehicle': 70, 'stop': 71, 'water': 72, 'arm': 73, 'wear': 74, 'head': 75, 'hopper': 76, 'open': 77, 'verify': 78, 'bite': 79, 'tube': 80, 'structure': 81, 'center': 82, 'slide': 83, 'due': 84, 'piece': 85, 'hose': 86, 'gable': 87, 'small': 88, 'upper': 89, 'enter': 90, 'immediately': 91, 'edge': 92, 'weight': 93, 'first': 94, 'go': 95, 'ladder': 96, 'material': 97, 'turn': 98, 'helmet': 99, 'leg': 100, 'sting': 101, 'line': 102, 'pull': 103, 'injure': 104, 'ground': 105, 'car': 106, 'diameter': 107, 'injured': 108, 'bar': 109, 'belt': 110, 'event': 111, 'proceed': 112, 'pass': 113, 'lower': 114, 'movement': 115, 'produce': 116, 'rod': 117, 'get': 118, 'second': 119, 'neck': 120, 'loader': 121, 'attack': 122, 'allergic': 123, 'air': 124, 'left': 125, 'another': 126, 'eye': 127, 'suffer': 128, 'reaction': 129, 'strike': 130, 'continue': 131, 'pressure': 132, 'wound': 133, 'acid': 134, 'locate': 135, 'notice': 136, 'person': 137, 'driver': 138, 'find': 139, 'burn': 140, 'sheet': 141, 'blow': 142, 'cable': 143, 'hole': 144, 'transfer': 145, 'near': 146, 'hydraulic': 147, 'control': 148, 'without': 149, 'base': 150, 'feel': 151, 'top': 152, 'decide': 153, 'complete': 154, 'glass': 155, 'close': 156, 'pain': 157, 'inspection': 158, 'ore': 159, 'guard': 160, 'personnel': 161, 'scissor': 162, 'last': 163, 'drilling': 164, 'machine': 165, 'workshop': 166, 'start': 167, 'prepare': 168, 'tank': 169, 'region': 170, 'away': 171, 'gate': 172, 'soil': 173, 'suddenly': 174, 'behind': 175, 'fell': 176, 'lip': 177, 'frame': 178, 'forearm': 179, 'ring': 180, 'knee': 181, 'site': 182, 'change': 183, 'collection': 184, 'park': 185, 'boot': 186, 'box': 187, 'slight': 188, 'test': 189, 'help': 190, 'unit': 191, 'grate': 192, 'cleaning': 193, 'break': 194, 'steel': 195, 'mine': 196, 'give': 197, 'plant': 198, 'welder': 199, 'section': 200, 'chain': 201, 'pulley': 202, 'partner': 203, 'scoop': 204, 'roof': 205, 'surface': 206, 'travel': 207, 'climb': 208, 'projection': 209, 'cover': 210, 'direction': 211, 'cylinder': 212, 'flange': 213, 'solution': 214, 'maneuver': 215, 'lock': 216, 'lesion': 217, 'follow': 218, 'balance': 219, 'blade': 220, 'drainage': 221, 'road': 222, 'instant': 223, 'crown': 224, 'auxiliary': 225, 'handle': 226, 'check': 227, 'tower': 228, 'cabin': 229, 'wall': 230, 'room': 231, 'service': 232, 'tool': 233, 'sample': 234, 'cathode': 235, 'hammer': 236, 'third': 237, 'described': 238, 'ramp': 239, 'shoulder': 240, 'causing': 241, 'board': 242, 'sleeve': 243, 'bag': 244, 'scaffold': 245, 'detach': 246, 'valve': 247, 'evaluate': 248, 'hospital': 249, 'realize': 250, 'removal': 251, 'thumb': 252, 'zinc': 253, 'refer': 254, 'door': 255, 'lose': 256, 'hook': 257, 'presence': 258, 'chute': 259, 'flow': 260, 'mud': 261, 'superficial': 262, 'supervisor': 263, 'evacuate': 264, 'push': 265, 'fill': 266, 'staff': 267, 'throw': 268, 'concrete': 269, 'lever': 270, 'iron': 271, 'force': 272, 'wooden': 273, 'system': 274, 'jumbo': 275, 'mining': 276, 'type': 277, 'shotcrete': 278, 'wash': 279, 'around': 280, 'tire': 281, 'easel': 282, 'forest': 283, 'swell': 284, 'see': 285, 'drop': 286, 'observe': 287, 'untimely': 288, 'raise': 289, 'rim': 290, 'hot': 291, 'rubber': 292, 'would': 293, 'split': 294, 'chin': 295, 'ventilation': 296, 'occur': 297, 'victim': 298, 'remain': 299, 'proceeds': 300, 'beam': 301, 'mechanical': 302, 'suction': 303, 'receive': 304, 'operation': 305, 'body': 306, 'roll': 307, 'lens': 308, 'avoid': 309, 'liquid': 310, 'twist': 311, 'rope': 312, 'paralyze': 313, 'normally': 314, 'ear': 315, 'chimney': 316, 'container': 317, 'walk': 318, 'geological': 319, 'normal': 320, 'quickly': 321, 'pit': 322, 'bolter': 323, 'mill': 324, 'suffers': 325, 'forehead': 326, 'evaluation': 327, 'splash': 328, 'later': 329, 'discomfort': 330, 'tie': 331, 'city': 332, 'indicate': 333, 'discharge': 334, 'inner': 335, 'manually': 336, 'task': 337, 'pick': 338, 'surprise': 339, 'day': 340, 'chamber': 341, 'central': 342, 'rest': 343, 'high': 344, 'electric': 345, 'south': 346, 'identify': 347, 'little': 348, 'minor': 349, 'post': 350, 'clerk': 351, 'length': 352, 'attention': 353, 'table': 354, 'local': 355, 'broken': 356, 'pulp': 357, 'toward': 358, 'anchor': 359, 'clamp': 360, 'tip': 361, 'distance': 362, 'ankle': 363, 'grating': 364, 'rail': 365, 'crane': 366, 'light': 367, 'manipulate': 368, 'protection': 369, 'rotation': 370, 'leather': 371, 'product': 372, 'ampoloader': 373, 'approximate': 374, 'installation': 375, 'evacuation': 376, 'sound': 377, 'cloth': 378, 'arrange': 379, 'put': 380, 'strap': 381, 'prevent': 382, 'attempt': 383, 'heat': 384, 'oil': 385, 'performs': 386, 'activate': 387, 'palm': 388, 'positive': 389, 'engine': 390, 'emergency': 391, 'pin': 392, 'boiler': 393, 'gutter': 394, 'aid': 395, 'internal': 396, 'index': 397, 'electrician': 398, 'loosen': 399, 'probe': 400, 'care': 401, 'need': 402, 'bucket': 403, 'wire': 404, 'bruise': 405, 'bend': 406, 'inch': 407, 'tipper': 408, 'middle': 409, 'rub': 410, 'loose': 411, 'begin': 412, 'unload': 413, 'rafael': 414, 'danillo': 415, 'filter': 416, 'thermal': 417, 'manual': 418, 'trap': 419, 'zone': 420, 'target': 421, 'manoel': 422, 'noise': 423, 'well': 424, 'ingot': 425, 'fifth': 426, 'retire': 427, 'instal': 428, 'finish': 429, 'descend': 430, 'accompany': 431, 'fourth': 432, 'long': 433, 'clinic': 434, 'convoy': 435, 'operate': 436, 'fire': 437, 'hdpe': 438, 'displacement': 439, 'metallic': 440, 'shape': 441, 'entrance': 442, 'protrude': 443, 'sit': 444, 'lamp': 445, 'old': 446, 'previously': 447, 'supervise': 448, 'electrical': 449, 'protector': 450, 'ustulation': 451, 'next': 452, 'felt': 453, 'ppe': 454, 'irritation': 455, 'chisel': 456, 'empty': 457, 'free': 458, 'enters': 459, 'generates': 460, 'bank': 461, 'approach': 462, 'shift': 463, 'struck': 464, 'rupture': 465, 'could': 466, 'imprisons': 467, 'degree': 468, 'imprison': 469, 'wrist': 470, 'lunch': 471, 'spill': 472, 'nut': 473, 'fracture': 474, 'driller': 475, 'intersection': 476, 'result': 477, 'direct': 478, 'together': 479, 'allergy': 480, 'rear': 481, 'extension': 482, 'contusion': 483, 'bump': 484, 'helper': 485, 'doctor': 486, 'form': 487, 'deep': 488, 'full': 489, 'cheekbone': 490, 'chest': 491, 'uniform': 492, 'main': 493, 'house': 494, 'suspend': 495, 'affect': 496, 'people': 497, 'protective': 498, 'winch': 499, 'stick': 500, 'overflow': 501, 'transmission': 502, 'strip': 503, 'assembly': 504, 'geologist': 505, 'saw': 506, 'oven': 507, 'hoist': 508, 'accommodate': 509, 'catch': 510, 'bitten': 511, 'way': 512, 'seat': 513, 'reducer': 514, 'wrench': 515, 'housing': 516, 'anode': 517, 'correct': 518, 'sulfuric': 519, 'initiate': 520, 'stand': 521, 'radius': 522, 'align': 523, 'mouth': 524, 'say': 525, 'bottom': 526, 'note': 527, 'forward': 528, 'involve': 529, 'pasco': 530, 'fence': 531, 'loud': 532, 'apparently': 533, 'correspond': 534, 'cart': 535, 'trip': 536, 'general': 537, 'vegetation': 538, 'shock': 539, 'wheel': 540, 'track': 541, 'call': 542, 'shank': 543, 'license': 544, 'sole': 545, 'perforation': 546, 'window': 547, 'advance': 548, 'stumble': 549, 'adjustment': 550, 'power': 551, 'connection': 552, 'gun': 553, 'pot': 554, 'pvc': 555, 'store': 556, 'bounce': 557, 'initial': 558, 'cone': 559, 'paint': 560, 'fine': 561, 'distal': 562, 'present': 563, 'william': 564, 'cross': 565, 'mineral': 566, 'maid': 567, 'limb': 568, 'impromec': 569, 'routine': 570, 'waste': 571, 'assist': 572, 'soon': 573, 'mapping': 574, 'radio': 575, 'attend': 576, 'still': 577, 'xxcm': 578, 'cocada': 579, 'jose': 580, 'fan': 581, 'assemble': 582, 'explosive': 583, 'sustain': 584, 'obstruct': 585, 'chuteo': 586, 'weigh': 587, 'squat': 588, 'cap': 589, 'communicate': 590, 'rebound': 591, 'compose': 592, 'divine': 593, 'accumulation': 594, 'fit': 595, 'nilton': 596, 'treat': 597, 'recovery': 598, 'anfoloader': 599, 'forklift': 600, 'jhonatan': 601, 'dust': 602, 'holder': 603, 'launch': 604, 'jump': 605, 'necessary': 606, 'coil': 607, 'warehouse': 608, 'lane': 609, 'coordinate': 610, 'incident': 611, 'scooptram': 612, 'space': 613, 'arrive': 614, 'alone': 615, 'bomb': 616, 'leak': 617, 'official': 618, 'stage': 619, 'hdp': 620, 'brake': 621, 'drawer': 622, 'ask': 623, 'particle': 624, 'slope': 625, 'weld': 626, 'cmxcmxcm': 627, 'four': 628, 'tension': 629, 'alpha': 630, 'decides': 631, 'grid': 632, 'collect': 633, 'run': 634, 'maribondos': 635, 'tell': 636, 'marco': 637, 'negative': 638, 'polyethylene': 639, 'compress': 640, 'cell': 641, 'short': 642, 'sling': 643, 'external': 644, 'slightly': 645, 'secure': 646, 'impacted': 647, 'station': 648, 'happen': 649, 'natclar': 650, 'orlando': 651, 'boltec': 652, 'want': 653, 'nail': 654, 'furnace': 655, 'survey': 656, 'filling': 657, 'outside': 658, 'storm': 659, 'mean': 660, 'electrowelded': 661, 'square': 662, 'inspect': 663, 'along': 664, 'plug': 665, 'paulo': 666, 'encounter': 667, 'lean': 668, 'residual': 669, 'contaminate': 670, 'bearing': 671, 'bear': 672, 'thigh': 673, 'gallery': 674, 'supervision': 675, 'unexpectedly': 676, 'radial': 677, 'contain': 678, 'several': 679, 'attach': 680, 'look': 681, 'office': 682, 'shoe': 683, 'pierce': 684, 'stump': 685, 'wood': 686, 'since': 687, 'graze': 688, 'action': 689, 'van': 690, 'intermediate': 691, 'knife': 692, 'arc': 693, 'amount': 694, 'accessory': 695, 'couple': 696, 'rice': 697, 'cook': 698, 'tilt': 699, 'backwards': 700, 'design': 701, 'effect': 702, 'leakage': 703, 'abruptly': 704, 'fuel': 705, 'increase': 706, 'chuck': 707, 'exert': 708, 'flash': 709, 'scaller': 710, 'paracatu': 711, 'secondary': 712, 'opening': 713, 'additive': 714, 'lid': 715, 'impacting': 716, 'stung': 717, 'wasp': 718, 'require': 719, 'cesar': 720, 'copilot': 721, 'hat': 722, 'already': 723, 'used': 724, 'industrial': 725, 'guillotine': 726, 'mount': 727, 'vertical': 728, 'residue': 729, 'previous': 730, 'telescopic': 731, 'stilson': 732, 'stone': 733, 'substation': 734, 'three': 735, 'appear': 736, 'sta': 737, 'rush': 738, 'sketch': 739, 'reconnaissance': 740, 'felipe': 741, 'jacket': 742, 'staircase': 743, 'steam': 744, 'incimmet': 745, 'cement': 746, 'dismantle': 747, 'exchange': 748, 'heavy': 749, 'fact': 750, 'strut': 751, 'warn': 752, 'respective': 753, 'entry': 754, 'treatment': 755, 'false': 756, 'pole': 757, 'across': 758, 'mask': 759, 'member': 760, 'fabio': 761, 'robson': 762, 'screen': 763, 'reduce': 764, 'aluminum': 765, 'sink': 766, 'exit': 767, 'disk': 768, 'originate': 769, 'autoclave': 770, 'quirodactyl': 771, 'feeder': 772, 'unlocking': 773, 'pound': 774, 'brace': 775, 'condition': 776, 'excoriation': 777, 'cast': 778, 'litorina': 779, 'engineer': 780, 'blunt': 781, 'stretch': 782, 'atlas': 783, 'hooked': 784, 'thickener': 785, 'motor': 786, 'drain': 787, 'horse': 788, 'cervical': 789, 'leaf': 790, 'canvas': 791, 'rung': 792, 'stool': 793, 'albino': 794, 'communicates': 795, 'angle': 796, 'current': 797, 'sediment': 798, 'cutter': 799, 'state': 800, 'highway': 801, 'aripuana': 802, 'hurry': 803, 'girdle': 804, 'goggles': 805, 'prospector': 806, 'channel': 807, 'ripper': 808, 'safe': 809, 'metatarsal': 810, 'colleague': 811, 'tail': 812, 'nose': 813, 'fix': 814, 'twice': 815, 'incimet': 816, 'pink': 817, 'teacher': 818, 'rise': 819, 'cyclone': 820, 'spike': 821, 'geho': 822, 'shaft': 823, 'union': 824, 'production': 825, 'row': 826, 'nylon': 827, 'street': 828, 'volumetric': 829, 'balloon': 830, 'pneumatic': 831, 'dry': 832, 'module': 833, 'kelly': 834, 'rlc': 835, 'engage': 836, 'adjust': 837, 'tether': 838, 'lifeline': 839, 'duty': 840, 'pas': 841, 'sardinel': 842, 'trench': 843, 'ignite': 844, 'battery': 845, 'list': 846, 'crash': 847, 'disabled': 848, 'deenergized': 849, 'disassemble': 850, 'oxyfuel': 851, 'proingcom': 852, 'foreman': 853, 'refuge': 854, 'orange': 855, 'alert': 856, 'detector': 857, 'hear': 858, 'provoke': 859, 'install': 860, 'grab': 861, 'mceisa': 862, 'respond': 863, 'culminate': 864, 'slop': 865, 'elbow': 866, 'detachment': 867, 'rotor': 868, 'superficially': 869, 'thread': 870, 'percussion': 871, 'corrugate': 872, 'ahk': 873, 'empresa': 874, 'serve': 875, 'cma': 876, 'operational': 877, 'excavate': 878, 'occupant': 879, 'incline': 880, 'waelz': 881, 'oxide': 882, 'rpa': 883, 'cro': 884, 'lie': 885, 'immediate': 886, 'possibly': 887, 'epp': 888, 'inertia': 889, 'puddle': 890, 'dumper': 891, 'steer': 892, 'mechanized': 893, 'tajo': 894, 'heel': 895, 'joint': 896, 'sodium': 897, 'socket': 898, 'sure': 899, 'nitric': 900, 'bap': 901, 'emerson': 902, 'include': 903, 'content': 904, 'pour': 905, 'tear': 906, 'peristaltic': 907, 'reserve': 908, 'started': 909, 'skin': 910, 'ppes': 911, 'stem': 912, 'seal': 913, 'boom': 914, 'plastic': 915, 'large': 916, 'bring': 917, 'imprisonment': 918, 'introduce': 919, 'shear': 920, 'bine': 921, 'piston': 922, 'phalanx': 923, 'pad': 924, 'energy': 925, 'energize': 926, 'panel': 927, 'shell': 928, 'sanitation': 929, 'underground': 930, 'rescue': 931, 'brigade': 932, 'stretcher': 933, 'vitaulic': 934, 'intention': 935, 'apply': 936, 'mix': 937, 'thickness': 938, 'bottle': 939, 'label': 940, 'expel': 941, 'low': 942, 'transit': 943, 'grs': 944, 'pedro': 945, 'insect': 946, 'develop': 947, 'collar': 948, 'responsible': 949, 'dining': 950, 'drive': 951, 'loses': 952, 'solid': 953, 'magnetometric': 954, 'gilvanio': 955, 'antiallergic': 956, 'marcelo': 957, 'cristian': 958, 'administrative': 959, 'bore': 960, 'lead': 961, 'escape': 962, 'partially': 963, 'polypropylene': 964, 'starter': 965, 'marcio': 966, 'sergio': 967, 'clearing': 968, 'spool': 969, 'involuntarily': 970, 'finally': 971, 'friction': 972, 'positioning': 973, 'antonio': 974, 'blast': 975, 'fixing': 976, 'management': 977, 'toe': 978, 'chicoteo': 979, 'aggregate': 980, 'shocrete': 981, 'designate': 982, 'hears': 983, 'opened': 984, 'unclog': 985, 'chemical': 986, 'caustic': 987, 'soda': 988, 'directly': 989, 'alimak': 990, 'perceives': 991, 'east': 992, 'farm': 993, 'lazaro': 994, 'divino': 995, 'morais': 996, 'ciliary': 997, 'outcrop': 998, 'machete': 999, 'snake': 1000, 'pilot': 1001, 'lhd': 1002, 'lance': 1003, 'trauma': 1004, 'sao': 1005, 'tabolas': 1006, 'conduct': 1007, 'shotcreterepentinamente': 1008, 'superior': 1009, 'injures': 1010, 'standing': 1011, 'rops': 1012, 'fops': 1013, 'polyontusions': 1014, 'scoria': 1015, 'big': 1016, 'nearby': 1017, 'eyewash': 1018, 'respirator': 1019, 'diamond': 1020, 'lateral': 1021, 'guide': 1022, 'lenses': 1023, 'food': 1024, 'unicon': 1025, 'reverse': 1026, 'supply': 1027, 'beehive': 1028, 'excite': 1029, 'legging': 1030, 'penultimate': 1031, 'derails': 1032, 'bridge': 1033, 'maperu': 1034, 'consultant': 1035, 'invade': 1036, 'civilian': 1037, 'sharply': 1038, 'melt': 1039, 'accord': 1040, 'inthinc': 1041, 'width': 1042, 'gts': 1043, 'instep': 1044, 'building': 1045, 'automatic': 1046, 'collide': 1047, 'handrail': 1048, 'effort': 1049, 'stair': 1050, 'wide': 1051, 'september': 1052, 'ahead': 1053, 'prick': 1054, 'construction': 1055, 'mason': 1056, 'sand': 1057, 'din': 1058, 'corner': 1059, 'clear': 1060, 'tread': 1061, 'tecnomin': 1062, 'bodeguero': 1063, 'via': 1064, 'iii': 1065, 'renato': 1066, 'procedure': 1067, 'chicken': 1068, 'strong': 1069, 'leaching': 1070, 'sludge': 1071, 'failure': 1072, 'cep': 1073, 'electrolysis': 1074, 'calf': 1075, 'eyelid': 1076, 'analysis': 1077, 'curl': 1078, 'abrupt': 1079, 'suture': 1080, 'slid': 1081, 'downward': 1082, 'unlock': 1083, 'repair': 1084, 'splinter': 1085, 'chirodactilo': 1086, 'sudden': 1087, 'signal': 1088, 'sampler': 1089, 'depth': 1090, 'pen': 1091, 'interior': 1092, 'luis': 1093, 'mobile': 1094, 'barretilla': 1095, 'simba': 1096, 'bit': 1097, 'withdraw': 1098, 'skimmer': 1099, 'obb': 1100, 'danon': 1101, 'imprisoned': 1102, 'impregnate': 1103, 'welding': 1104, 'flex': 1105, 'lubricator': 1106, 'hattype': 1107, 'lubricant': 1108, 'affected': 1109, 'propeller': 1110, 'horizontally': 1111, 'stack': 1112, 'seek': 1113, 'medicate': 1114, 'confined': 1115, 'shower': 1116, 'participate': 1117, 'pocket': 1118, 'nro': 1119, 'borehole': 1120, 'cruise': 1121, 'cab': 1122, 'screwdriver': 1123, 'restart': 1124, 'reason': 1125, 'clothes': 1126, 'cabinet': 1127, 'stp': 1128, 'coworker': 1129, 'asks': 1130, 'bracket': 1131, 'debark': 1132, 'anfo': 1133, 'shovel': 1134, 'marcos': 1135, 'stoop': 1136, 'deviate': 1137, 'whistle': 1138, 'lyner': 1139, 'rotate': 1140, 'sledgehammer': 1141, 'tunnel': 1142, 'final': 1143, 'mild': 1144, 'samuel': 1145, 'concentrate': 1146, 'gear': 1147, 'abratech': 1148, 'putty': 1149, 'hyt': 1150, 'tick': 1151, 'drag': 1152, 'diagonal': 1153, 'rubs': 1154, 'crouch': 1155, 'marked': 1156, 'tape': 1157, 'lookout': 1158, 'ademir': 1159, 'mario': 1160, 'five': 1161, 'excavation': 1162, 'barel': 1163, 'resident': 1164, 'rollover': 1165, 'dump': 1166, 'transversely': 1167, 'vazante': 1168, 'mata': 1169, 'serra': 1170, 'garrote': 1171, 'wca': 1172, 'leandro': 1173, 'jehovanio': 1174, 'shallow': 1175, 'carton': 1176, 'possible': 1177, 'breno': 1178, 'consequently': 1179, 'belly': 1180, 'jehovah': 1181, 'screw': 1182, 'observed': 1183, 'problem': 1184, 'diesel': 1185, 'accidentally': 1186, 'carbon': 1187, 'ustulador': 1188, 'excess': 1189, 'distributor': 1190, 'camera': 1191, 'contracture': 1192, 'sulphide': 1193, 'magazine': 1194, 'feed': 1195, 'storage': 1196, 'explode': 1197, 'hr': 1198, 'pom': 1199, 'tray': 1200, 'torch': 1201, 'foam': 1202, 'pipette': 1203, 'eliseo': 1204, 'hoe': 1205, 'fixed': 1206, 'abdomen': 1207, 'jaw': 1208, 'wedge': 1209, 'crusher': 1210, 'device': 1211, 'overhead': 1212, 'ronald': 1213, 'lighthouse': 1214, 'symptom': 1215, 'bypass': 1216, 'raul': 1217, 'rolando': 1218, 'helical': 1219, 'overhang': 1220, 'iscmg': 1221, 'decrease': 1222, 'isidro': 1223, 'torres': 1224, 'standardization': 1225, 'mixed': 1226, 'attribute': 1227, 'clog': 1228, 'eject': 1229, 'detritus': 1230, 'circuit': 1231, 'tito': 1232, 'anything': 1233, 'ordinary': 1234, 'switch': 1235, 'addition': 1236, 'refurbishment': 1237, 'emulsion': 1238, 'ceiling': 1239, 'tunel': 1240, 'abutment': 1241, 'stirrup': 1242, 'continued': 1243, 'watch': 1244, 'extraction': 1245, 'reporting': 1246, 'cia': 1247, 'stacker': 1248, 'reflux': 1249, 'gas': 1250, 'foliage': 1251, 'leucenas': 1252, 'adjutant': 1253, 'reference': 1254, 'attrition': 1255, 'dismantled': 1256, 'profile': 1257, 'security': 1258, 'fenced': 1259, 'stoppage': 1260, 'fright': 1261, 'sustaining': 1262, 'teams': 1263, 'purification': 1264, 'pressing': 1265, 'ajg': 1266, 'miss': 1267, 'gearbox': 1268, 'settle': 1269, 'month': 1270, 'interlace': 1271, 'rapid': 1272, 'bothering': 1273, 'blower': 1274, 'ambulance': 1275, 'flyght': 1276, 'lubrication': 1277, 'footdeep': 1278, 'success': 1279, 'shaped': 1280, 'like': 1281, 'cane': 1282, 'detaches': 1283, 'hoisting': 1284, 'bigbags': 1285, 'hoistings': 1286, 'lowvoltage': 1287, 'bigbag': 1288, 'cutoff': 1289, 'unloaded': 1290, 'visualizes': 1291, 'thrust': 1292, 'accumulate': 1293, 'dismount': 1294, 'visualize': 1295, 'shin': 1296, 'deceased': 1297, 'supervisory': 1298, 'infrastructure': 1299, 'julio': 1300, 'toilet': 1301, 'bra': 1302, 'sickle': 1303, 'vine': 1304, 'liana': 1305, 'rhainer': 1306, 'object': 1307, 'thus': 1308, 'pasture': 1309, 'recently': 1310, 'residence': 1311, 'spatula': 1312, 'spear': 1313, 'windows': 1314, 'cluster': 1315, 'sleeper': 1316, 'submerge': 1317, 'loosens': 1318, 'cardan': 1319, 'connector': 1320, 'xcm': 1321, 'able': 1322, 'scaler': 1323, 'restricts': 1324, 'insertion': 1325, 'blind': 1326, 'wedges': 1327, 'hydroxide': 1328, 'disconnect': 1329, 'demineralization': 1330, 'sensor': 1331, 'comedor': 1332, 'lemon': 1333, 'voltage': 1334, 'outlet': 1335, 'cord': 1336, 'act': 1337, 'laboratory': 1338, 'coat': 1339, 'absorb': 1340, 'december': 1341, 'demister': 1342, 'cool': 1343, 'jaba': 1344, 'applies': 1345, 'cold': 1346, 'peel': 1347, 'chirodactile': 1348, 'attendant': 1349, 'compartment': 1350, 'classification': 1351, 'litter': 1352, 'assistants': 1353, 'disengage': 1354, 'needle': 1355, 'retraction': 1356, 'sulfur': 1357, 'dioxide': 1358, 'overpressure': 1359, 'cormei': 1360, 'eissa': 1361, 'cosapi': 1362, 'usual': 1363, 'unloading': 1364, 'bladder': 1365, 'charge': 1366, 'silo': 1367, 'delivery': 1368, 'surround': 1369, 'funnel': 1370, 'man': 1371, 'waterthinner': 1372, 'mixture': 1373, 'redness': 1374, 'burning': 1375, 'pvctype': 1376, 'wilder': 1377, 'gilton': 1378, 'introduces': 1379, 'imprisoning': 1380, 'workermechanic': 1381, 'toecap': 1382, 'tenth': 1383, 'cmxcm': 1384, 'zero': 1385, 'favor': 1386, 'deconcentrates': 1387, 'victaulica': 1388, 'copla': 1389, 'thermomagnetic': 1390, 'slaughter': 1391, 'choco': 1392, 'jka': 1393, 'promptly': 1394, 'outpatient': 1395, 'municipal': 1396, 'cruz': 1397, 'shipment': 1398, 'rigger': 1399, 'proceeded': 1400, 'connect': 1401, 'desanding': 1402, 'wagon': 1403, 'harden': 1404, 'stake': 1405, 'us': 1406, 'solubilization': 1407, 'chapel': 1408, 'vial': 1409, 'doser': 1410, 'jesus': 1411, 'shoot': 1412, 'realizes': 1413, 'passage': 1414, 'preuse': 1415, 'sip': 1416, 'noticing': 1417, 'enough': 1418, 'esengrasante': 1419, 'machinery': 1420, 'toxicity': 1421, 'extract': 1422, 'vsd': 1423, 'lbs': 1424, 'freddy': 1425, 'pricked': 1426, 'future': 1427, 'portion': 1428, 'beetle': 1429, 'size': 1430, 'manifest': 1431, 'occurred': 1432, 'shirt': 1433, 'shield': 1434, 'localized': 1435, 'swelling': 1436, 'claudio': 1437, 'readjust': 1438, 'greater': 1439, 'torque': 1440, 'wilber': 1441, 'indexed': 1442, 'turntable': 1443, 'garit': 1444, 'chicrin': 1445, 'santa': 1446, 'informs': 1447, 'longer': 1448, 'complain': 1449, 'intense': 1450, 'lumbar': 1451, 'cite': 1452, 'overexertion': 1453, 'moth': 1454, 'sunglasses': 1455, 'marimbondos': 1456, 'drove': 1457, 'medicine': 1458, 'situation': 1459, 'also': 1460, 'good': 1461, 'lavras': 1462, 'sul': 1463, 'consult': 1464, 'ssomac': 1465, 'enmicadas': 1466, 'page': 1467, 'yolk': 1468, 'bapdd': 1469, 'poll': 1470, 'suitably': 1471, 'resulted': 1472, 'pickup': 1473, 'soft': 1474, 'igor': 1475, 'discover': 1476, 'photo': 1477, 'touch': 1478, 'period': 1479, 'continuously': 1480, 'identifies': 1481, 'hydrojet': 1482, 'obstruction': 1483, 'actuate': 1484, 'pedal': 1485, 'blown': 1486, 'shipper': 1487, 'anchorage': 1488, 'diamantina': 1489, 'xrd': 1490, 'bob': 1491, 'simultaneously': 1492, 'lack': 1493, 'luna': 1494, 'cruiser': 1495, 'pentacord': 1496, 'fanel': 1497, 'mark': 1498, 'breeder': 1499, 'measurement': 1500, 'fasten': 1501, 'tellomoinsac': 1502, 'vanishes': 1503, 'packaging': 1504, 'cylindrical': 1505, 'wellfield': 1506, 'tried': 1507, 'tree': 1508, 'underwent': 1509, 'moor': 1510, 'directs': 1511, 'gaze': 1512, 'sacrifice': 1513, 'porvenir': 1514, 'accretion': 1515, 'ustulacion': 1516, 'duct': 1517, 'disrupt': 1518, 'camp': 1519, 'laundry': 1520, 'erase': 1521, 'earthenware': 1522, 'alizado': 1523, 'tour': 1524, 'command': 1525, 'atriction': 1526, 'anterior': 1527, 'share': 1528, 'equally': 1529, 'falls': 1530, 'spoiler': 1531, 'kneel': 1532, 'warman': 1533, 'lxbb': 1534, 'victor': 1535, 'visual': 1536, 'alimakero': 1537, 'cage': 1538, 'untie': 1539, 'spark': 1540, 'stope': 1541, 'corridor': 1542, 'reinstallation': 1543, 'tanker': 1544, 'north': 1545, 'skid': 1546, 'defensive': 1547, 'fulcrum': 1548, 'traumatism': 1549, 'trailer': 1550, 'shutter': 1551, 'aramid': 1552, 'testimony': 1553, 'bonsucesso': 1554, 'research': 1555, 'geosol': 1556, 'trestle': 1557, 'lucas': 1558, 'ltda': 1559, 'visit': 1560, 'juveni': 1561, 'dizziness': 1562, 'faintness': 1563, 'concussion': 1564, 'electrometallurgy': 1565, 'code': 1566, 'ele': 1567, 'abb': 1568, 'referred': 1569, 'bioxide': 1570, 'spent': 1571, 'fissure': 1572, 'subsequently': 1573, 'volvo': 1574, 'oxicorte': 1575, 'solder': 1576, 'dosage': 1577, 'centralizer': 1578, 'facilitate': 1579, 'accelerate': 1580, 'tightens': 1581, 'ddh': 1582, 'explomin': 1583, 'socorro': 1584, 'drillerwas': 1585, 'rotates': 1586, 'ensure': 1587, 'afo': 1588, 'bonifacio': 1589, 'robot': 1590, 'emptiness': 1591, 'enoc': 1592, 'sensation': 1593, 'correctly': 1594, 'placing': 1595, 'rack': 1596, 'fabric': 1597, 'mixer': 1598, 'lights': 1599, 'drum': 1600, 'displaces': 1601, 'location': 1602, 'technical': 1603, 'pause': 1604, 'know': 1605, 'weed': 1606, 'communication': 1607, 'railway': 1608, 'reposition': 1609, 'stability': 1610, 'frank': 1611, 'maintain': 1612, 'overlap': 1613, 'nonsustained': 1614, 'blaster': 1615, 'cutblunt': 1616, 'mat': 1617, 'wet': 1618, 'slippery': 1619, 'brushcutters': 1620, 'average': 1621, 'ajani': 1622, 'liliana': 1623, 'prepares': 1624, 'folder': 1625, 'iglu': 1626, 'diagonally': 1627, 'inward': 1628, 'stabilizes': 1629, 'portable': 1630, 'hang': 1631, 'hinge': 1632, 'closing': 1633, 'grind': 1634, 'triangular': 1635, 'measure': 1636, 'tranquera': 1637, 'raspandose': 1638, 'thorax': 1639, 'denis': 1640, 'imbalance': 1641, 'manipulation': 1642, 'propiciandose': 1643, 'powder': 1644, 'excessive': 1645, 'sprain': 1646, 'unbalancing': 1647, 'twisting': 1648, 'eriks': 1649, 'tecl': 1650, 'inchancable': 1651, 'mag': 1652, 'murilo': 1653, 'acquisition': 1654, 'gap': 1655, 'traverse': 1656, 'ravine': 1657, 'xray': 1658, 'examination': 1659, 'physician': 1660, 'serious': 1661, 'transverse': 1662, 'confipetrol': 1663, 'reduction': 1664, 'progress': 1665, 'secured': 1666, 'vms': 1667, 'xixas': 1668, 'swarm': 1669, 'play': 1670, 'visibility': 1671, 'hiss': 1672, 'rip': 1673, 'tangled': 1674, 'stopper': 1675, 'mortar': 1676, 'improve': 1677, 'bricklayer': 1678, 'per': 1679, 'occurs': 1680, 'personal': 1681, 'dimension': 1682, 'tabola': 1683, 'woman': 1684, 'tap': 1685, 'cinnamon': 1686, 'francisco': 1687, 'lectrowelded': 1688, 'eyebrow': 1689, 'warrin': 1690, 'crack': 1691, 'inlet': 1692, 'laminator': 1693, 'stabilizer': 1694, 'hiab': 1695, 'winery': 1696, 'grinder': 1697, 'adapt': 1698, 'crosscutter': 1699, 'traumatic': 1700, 'amputation': 1701, 'killer': 1702, 'manitou': 1703, 'inform': 1704, 'scruber': 1705, 'fee': 1706, 'georli': 1707, 'tqs': 1708, 'manuel': 1709, 'disconnection': 1710, 'manco': 1711, 'cajamarquilla': 1712, 'yield': 1713, 'eustaquio': 1714, 'luxofractures': 1715, 'leaning': 1716, 'spend': 1717, 'taque': 1718, 'pendulum': 1719, 'deslaminator': 1720, 'detect': 1721, 'manipulator': 1722, 'neutral': 1723, 'leach': 1724, 'airlift': 1725, 'hitchhike': 1726, 'canterio': 1727, 'opposite': 1728, 'spin': 1729, 'containment': 1730, 'basin': 1731, 'formation': 1732, 'stuck': 1733, 'pointed': 1734, 'brjcldd': 1735, 'servitecforaco': 1736, 'july': 1737, 'josimar': 1738, 'fish': 1739, 'jack': 1740, 'entire': 1741, 'effective': 1742, 'inefficacy': 1743, 'do': 1744, 'chestnut': 1745, 'monkey': 1746, 'faucet': 1747, 'firmly': 1748, 'composition': 1749, 'stitch': 1750, 'cathodic': 1751, 'digger': 1752, 'doosan': 1753, 'suspender': 1754, 'suffered': 1755, 'illness': 1756, 'headlight': 1757, 'defective': 1758, 'formerly': 1759, 'yard': 1760, 'pushed': 1761, 'courier': 1762, 'heading': 1763, 'pique': 1764, 'electricians': 1765, 'paralyzed': 1766, 'half': 1767, 'trainee': 1768, 'planamieto': 1769, 'notebook': 1770, 'difficult': 1771, 'winemaker': 1772, 'registered': 1773, 'yaranga': 1774, 'juan': 1775, 'react': 1776, 'manage': 1777, 'fuse': 1778, 'ith': 1779, 'pay': 1780, 'answer': 1781, 'distract': 1782, 'even': 1783, 'swing': 1784, 'slimming': 1785, 'kiln': 1786, 'crucible': 1787, 'albertico': 1788, 'jhony': 1789, 'cockpit': 1790, 'launcher': 1791, 'noticed': 1792, 'hastial': 1793, 'labor': 1794, 'utensil': 1795, 'stir': 1796, 'cooker': 1797, 'hidalgo': 1798, 'unstable': 1799, 'reel': 1800, 'frontally': 1801, 'replace': 1802, 'expansion': 1803, 'chemo': 1804, 'conclusion': 1805, 'amp': 1806, 'hatch': 1807, 'measuring': 1808, 'surcharge': 1809, 'ship': 1810, 'hemiface': 1811, 'reinforce': 1812, 'deepen': 1813, 'distant': 1814, 'atenuz': 1815, 'excavator': 1816, 'baton': 1817, 'eusebio': 1818, 'fail': 1819, 'lubricate': 1820, 'alfredo': 1821, 'spillway': 1822, 'absorbent': 1823, 'ax': 1824, 'compressor': 1825, 'bonnet': 1826, 'function': 1827, 'rag': 1828, 'ompressor': 1829, 'stroke': 1830, 'jibs': 1831, 'jib': 1832, 'enforce': 1833, 'reception': 1834, 'making': 1835, 'keypad': 1836, 'manipulates': 1837, 'expedition': 1838, 'overall': 1839, 'hauling': 1840, 'motorist': 1841, 'rid': 1842, 'saddle': 1843, 'weakly': 1844, 'pipeline': 1845, 'seatbelt': 1846, 'accompanied': 1847, 'path': 1848, 'deslaminadora': 1849, 'mollares': 1850, 'brushed': 1851, 'mollaress': 1852, 'mince': 1853, 'juina': 1854, 'blackjack': 1855, 'manifestation': 1856, 'afternoon': 1857, 'habilitation': 1858, 'kitchen': 1859, 'specific': 1860, 'ditch': 1861, 'estimate': 1862, 'dune': 1863, 'sunday': 1864, 'ago': 1865, 'exchanger': 1866, 'define': 1867, 'risk': 1868, 'sulfate': 1869, 'conchucos': 1870, 'ancash': 1871, 'patronal': 1872, 'feast': 1873, 'represent': 1874, 'ceremony': 1875, 'fruit': 1876, 'toys': 1877, 'public': 1878, 'pyrotechnics': 1879, 'gift': 1880, 'frighten': 1881, 'kick': 1882, 'efrain': 1883, 'osorio': 1884, 'felix': 1885, 'mina': 1886, 'compressed': 1887, 'nozzle': 1888, 'lung': 1889, 'violent': 1890, 'stun': 1891, 'ball': 1892, 'hump': 1893, 'hill': 1894, 'aforementioned': 1895, 'descended': 1896, 'mudswathed': 1897, 'released': 1898, 'activates': 1899, 'walter': 1900, 'driven': 1901, 'request': 1902, 'data': 1903, 'congestion': 1904, 'fender': 1905, 'elevation': 1906, 'aerial': 1907, 'thinner': 1908, 'flammable': 1909, 'die': 1910, 'pead': 1911, 'geomembrane': 1912, 'blanket': 1913, 'seam': 1914, 'extruder': 1915, 'stylet': 1916, 'soldering': 1917, 'insulation': 1918, 'backhoe': 1919, 'moon': 1920, 'facial': 1921, 'maestranza': 1922, 'operating': 1923, 'bench': 1924, 'skip': 1925, 'verifies': 1926, 'everything': 1927, 'apparent': 1928, 'paste': 1929, 'vacuum': 1930, 'keep': 1931, 'void': 1932, 'warley': 1933, 'workplace': 1934, 'disposal': 1935, 'ammonia': 1936, 'refrigerant': 1937, 'topographic': 1938, 'west': 1939, 'sccop': 1940, 'mini': 1941, 'adapter': 1942, 'slab': 1943, 'lodged': 1944, 'tightening': 1945, 'neglected': 1946, 'rotary': 1947, 'breaking': 1948, 'particles': 1949, 'laceration': 1950, 'seven': 1951, 'thorns': 1952, 'copper': 1953, 'repulping': 1954, 'vision': 1955, 'segment': 1956, 'polyurethane': 1957, 'hip': 1958, 'revegetation': 1959, 'mallet': 1960, 'fisherman': 1961, 'whiplash': 1962, 'hycrontype': 1963, 'tractor': 1964, 'radiator': 1965, 'carpentry': 1966, 'diagnosis': 1967, 'conductive': 1968, 'rig': 1969, 'potion': 1970, 'localize': 1971, 'subsequent': 1972, 'silver': 1973, 'afterwards': 1974, 'latter': 1975, 'concreting': 1976, 'inferior': 1977, 'washing': 1978, 'fully': 1979, 'electrolyte': 1980, 'believe': 1981, 'nailed': 1982, 'properly': 1983, 'thorn': 1984, 'disassembled': 1985, 'profiles': 1986, 'paralysis': 1987, 'scare': 1988, 'caused': 1989, 'within': 1990, 'flat': 1991, 'pedestal': 1992, 'plan': 1993, 'shockbearing': 1994, 'stepladder': 1995, 'strength': 1996, 'eyelash': 1997, 'foundry': 1998, 'package': 1999, 'footwear': 2000, 'inclination': 2001, 'schedule': 2002, 'almost': 2003, 'harness': 2004, 'tailing': 2005, 'spare': 2006, 'violently': 2007, 'accidently': 2008, 'maximum': 2009, 'flexible': 2010, 'shot': 2011, 'deteriorate': 2012, 'blasting': 2013, 'fog': 2014, 'chooses': 2015, 'comfort': 2016, 'thrown': 2017, 'combination': 2018, 'junior': 2019, 'disassembly': 2020, 'pulpomatic': 2021, 'rivet': 2022, 'horizontal': 2023, 'shotcreteados': 2024, 'verification': 2025, 'confirm': 2026, 'coordination': 2027, 'downwards': 2028, 'gauge': 2029, 'unleashing': 2030, 'saturate': 2031, 'crest': 2032, 'rugged': 2033, 'crumbles': 2034, 'cheek': 2035, 'isolate': 2036, 'corresponding': 2037, 'link': 2038, 'shorten': 2039, 'injection': 2040, 'resin': 2041, 'band': 2042, 'uneven': 2043, 'distribution': 2044, 'upwards': 2045, 'timely': 2046, 'earth': 2047, 'cubic': 2048, 'minute': 2049, 'allow': 2050, 'adhesion': 2051, 'assume': 2052, 'response': 2053, 'death': 2054, 'investigation': 2055, 'poncho': 2056, 'plat': 2057, 'performer': 2058, 'lime': 2059, 'reactive': 2060, 'legs': 2061, 'upward': 2062, 'willing': 2063, 'displace': 2064, 'adhere': 2065, 'knuckle': 2066, 'gloves': 2067, 'dish': 2068, 'become': 2069, 'unbalanced': 2070, 'despite': 2071, 'indicated': 2072, 'spatter': 2073, 'hood': 2074, 'crew': 2075, 'subjection': 2076, 'achieve': 2077, 'cycle': 2078, 'uncover': 2079, 'prils': 2080, 'enable': 2081, 'importance': 2082, 'derive': 2083, 'unbalance': 2084, 'pillar': 2085, 'specify': 2086, 'figure': 2087, 'geology': 2088, 'temporarily': 2089, 'bos': 2090, 'soquet': 2091, 'hexagonal': 2092, 'worn': 2093, 'anticlockwise': 2094, 'gallon': 2095, 'derail': 2096, 'leathertype': 2097, 'debris': 2098, 'muscle': 2099, 'marking': 2100, 'scalp': 2101, 'laterally': 2102, 'instruct': 2103, 'monitoring': 2104, 'existence': 2105, 'tighten': 2106, 'eyebolt': 2107, 'dd': 2108, 'locker': 2109, 'misalignment': 2110, 'scraper': 2111, 'activation': 2112, 'pip': 2113, 'uncoupled': 2114, 'sulfide': 2115, 'ambulatory': 2116, 'gram': 2117, 'liter': 2118, 'overcome': 2119, 'resistance': 2120, 'lineman': 2121, 'reshape': 2122, 'beating': 2123, 'perceive': 2124, 'grabs': 2125, 'concentrator': 2126, 'flotation': 2127, 'chair': 2128, 'filtration': 2129, 'motion': 2130, 'expose': 2131, 'draw': 2132, 'jet': 2133, 'preventive': 2134, 'roller': 2135, 'warp': 2136, 'proximal': 2137, 'review': 2138, 'curve': 2139, 'unevenness': 2140, 'overturn': 2141, 'scrap': 2142, 'explosion': 2143, 'straight': 2144, 'progressive': 2145, 'temporary': 2146, 'burst': 2147, 'night': 2148, 'none': 2149, 'damage': 2150, 'pre': 2151, 'estriping': 2152, 'tranfer': 2153, 'exerts': 2154, 'former': 2155, 'primary': 2156, 'vieira': 2157, 'auxiliaries': 2158, 'diassis': 2159, 'consultation': 2160, 'diagnose': 2161, 'prescribe': 2162, 'remedy': 2163, 'ice': 2164, 'pack': 2165, 'evaporator': 2166, 'slag': 2167, 'pear': 2168, 'superciliary': 2169, 'divert': 2170, 'diversion': 2171, 'thug': 2172, 'agitate': 2173, 'pant': 2174, 'fiberglass': 2175, 'marble': 2176, 'breaker': 2177, 'bumped': 2178, 'element': 2179, 'polymer': 2180, 'heard': 2181, 'crushing': 2182, 'startup': 2183, 'projecting': 2184}
len(tokenizer.word_index)
2184
# Printing features before tokenizer
print(X_train[10])
instal segment polyurethane pulley protective lyner xxcm weigh head pulley ore winch pulley rotate compress lyner inside channel fall housing rub right side hip generate injury describe
X_train = tokenizer.texts_to_sequences(X_train.tolist())
# Printing features after tokenizer
print(X_train[10])
[828, 55, 125, 2, 829, 830, 46, 23, 2, 105, 829, 830, 47, 194, 1, 88, 133, 3, 2]
X_test = tokenizer.texts_to_sequences(X_test.tolist())
# Define maximum number of words to consider in each text
maxlen = max_description_len
# Pad training text
X_train = pad_sequences(X_train, maxlen= maxlen, padding='pre', truncating='post')
# Pad testing text
X_test = pad_sequences(X_test, maxlen= maxlen, padding='pre', truncating='post')
print(X_train.shape)
(382, 85)
num = np.random.randint(0, X_train.shape[0])
print(X_train[10])
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 828 55 125 2 829 830 46 23 2 105 829 830 47 194 1 88 133 3 2]
# Installing Gensim
!pip install gensim
Requirement already satisfied: gensim in c:\users\dsjohn\anaconda3\lib\site-packages (4.1.2) Requirement already satisfied: Cython==0.29.23 in c:\users\dsjohn\anaconda3\lib\site-packages (from gensim) (0.29.23) Requirement already satisfied: scipy>=0.18.1 in c:\users\dsjohn\anaconda3\lib\site-packages (from gensim) (1.6.2) Requirement already satisfied: smart-open>=1.8.1 in c:\users\dsjohn\anaconda3\lib\site-packages (from gensim) (5.2.1) Requirement already satisfied: numpy>=1.17.0 in c:\users\dsjohn\anaconda3\lib\site-packages (from gensim) (1.19.5)
import gensim
import gensim.downloader as api
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import Word2Vec, KeyedVectors
word2vec_output_file = working_dir + 'glove.6B.200d.txt.word2vec'
# Glove file - we are using model with 200 embeddings
# E:\\Great Learning\\DL\\Capstone\\Data\\
glove_input_file = working_dir + 'glove.6B.200d.txt'
# Name for word2vec file
# Converting glove embedding to word2vec embedding
glove2word2vec(glove_input_file, word2vec_output_file)
(400000, 200)
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
# Checking the most similar word from glove model
glove_model.most_similar('polyurethane')
[('varnish', 0.6505525708198547),
('urethane', 0.6350276470184326),
('fiberglass', 0.5904651880264282),
('neoprene', 0.5883365273475647),
('latex', 0.5864425897598267),
('epoxy', 0.5721665024757385),
('thermoplastic', 0.5650182962417603),
('silicone', 0.5631036758422852),
('acrylic', 0.551384449005127),
('foam', 0.5452547073364258)]
# Size of glove model
glove_model.vectors.shape
(400000, 200)
# Getting Pre-trained embedding
embedding_vector_length= glove_model.vector_size
embedding_vector_length
200
vocab_size = len(tokenizer.word_index)+1
vocab_size
2185
num_words = min(max_features, vocab_size)
num_words
2185
#Initialize embedding matrix for our dataset with 2340 (1 for padding word)
#and 200 columns (as embedding size is 200)
embedding_matrix = np.zeros((num_words, embedding_vector_length))
embedding_matrix.shape
(2185, 200)
#Loading word vectors for each word from glove Word2Vec model
for word, i in sorted(tokenizer.word_index.items(),key=lambda x:x[1]):
if i > (num_words):
break
try:
embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
embedding_matrix[i] = embedding_vector
except:
pass
# Embedding matix shape
embedding_matrix.shape
(2185, 200)
# Random visualizatiom after embedding
num = np.random.randint(0, embedding_matrix.shape[0])
print(embedding_matrix[num])
[-0.32326999 -0.36675 -0.13185 -0.56379002 -0.42116001 0.20433 -0.54413003 0.20637999 0.04524 0.43414 0.57060999 0.16978 -0.18277 -0.35747999 0.49268001 0.25354999 -0.45715001 0.51020998 0.24029 -0.31540999 0.3635 2.0545001 0.18916 -0.38025999 -0.16582 0.25003999 0.17705999 -0.29743999 -1.12960005 -0.24127001 -0.058341 0.087287 -0.17026 0.50075001 -0.080768 0.46362001 0.088897 -0.36934999 -0.33917999 0.46029001 -0.34512001 0.62216002 0.26919001 1.01170003 -0.23574001 -0.10663 -0.33114001 0.36204001 -0.065802 0.80120999 -0.15741 0.26693001 -0.20163999 -0.20665 0.70344001 -0.42394999 -0.018581 -0.071598 0.067712 0.24241 -0.54751998 0.16033 0.13736001 -0.39471999 -0.46085 -0.066823 0.071499 -0.47626001 0.97737002 -0.41894999 0.26176 -0.32587001 -0.6239 0.36346999 0.54324001 -0.71047002 -0.31488001 0.025794 -0.45247 -0.19077 0.20071 0.2832 0.18615 -0.47602999 0.036803 -0.13618 0.57489002 -0.68414003 1.11889994 -0.94349998 0.80093998 0.01431 -0.48872 -0.25681999 0.16804001 -0.41148999 -0.19047 -0.22786 -0.31057999 0.54718 0.75384998 -0.077524 -0.28849 0.35010999 -0.21080001 -0.46647 0.15561999 1.04079998 -0.10765 -0.58311999 -0.012608 -0.42991999 0.38339999 -0.58789998 0.54523998 -0.56431001 -0.37788999 0.44811001 -0.59954 0.072765 0.11047 -0.60029 -0.093856 1.27289999 0.051302 0.19742 -0.33338001 -0.12001 0.066616 0.013741 -0.26350999 -0.030146 0.16144 0.34333 -0.21758001 -0.048324 -0.26376 -0.46373999 0.10281 -0.35335001 0.37305 0.05287 0.24235 -0.11087 0.67831999 0.12819 -0.50506997 -0.55405998 0.49245 0.16103999 0.10553 0.22854 -0.89528 -0.35078001 0.12078 0.28422999 0.19175 0.25856 -0.92390001 -0.68254 0.71530998 -0.33204001 0.28531 -0.35808 0.36899 -0.28657001 0.15666001 0.13123 -0.029004 -0.85044003 -0.40516999 0.29131001 0.030412 -0.11627 0.054201 0.26908001 -0.64955002 -0.25652 0.030637 -0.0067978 0.63922 -0.094149 0.33203 0.41696 0.25716999 -1.22140002 -0.25990999 -0.37336999 -0.070481 0.40156999 -0.33206999 -0.26337999 0.42885 -0.05889 -0.34599 0.1208 -0.085774 0.4808 -0.28022999 0.13212 ]
# Initializing the model
clear_session()
nn_model = Sequential()
# Embedding layer
nn_model.add(Embedding(input_dim= num_words, output_dim= embedding_vector_length,
weights = [embedding_matrix],
trainable = False,
input_length = maxlen))
nn_model.output
<tf.Tensor 'embedding/embedding_lookup/Identity_1:0' shape=(None, 85, 200) dtype=float32>
Embedding Layer gives us 3D output -> [Batch_Size , Review Length , Embedding_Size]
# Flatten the data as will use Dense layer
nn_model.add(Flatten())
# Adding Hidden Layers(Dense layers)
nn_model.add(Dense(100, activation='relu', input_shape=()))
nn_model.add(Dropout(0.4))
nn_model.add(BatchNormalization())
nn_model.add(Dense(50, activation='relu'))
nn_model.add(Dropout(0.4))
nn_model.add(BatchNormalization())
nn_model.add(Dense(25, activation='relu'))
nn_model.add(Dropout(0.4))
# Adding output layer
nn_model.add(Dense(5, activation='softmax'))
# nn_model.add(Dense(5, activation='softmax'))
nn_model.output
<tf.Tensor 'dense_3/Softmax:0' shape=(None, 5) dtype=float32>
# Compiling the model
nn_model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
nn_model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 85, 200) 437000 _________________________________________________________________ flatten (Flatten) (None, 17000) 0 _________________________________________________________________ dense (Dense) (None, 100) 1700100 _________________________________________________________________ dropout (Dropout) (None, 100) 0 _________________________________________________________________ batch_normalization (BatchNo (None, 100) 400 _________________________________________________________________ dense_1 (Dense) (None, 50) 5050 _________________________________________________________________ dropout_1 (Dropout) (None, 50) 0 _________________________________________________________________ batch_normalization_1 (Batch (None, 50) 200 _________________________________________________________________ dense_2 (Dense) (None, 25) 1275 _________________________________________________________________ dropout_2 (Dropout) (None, 25) 0 _________________________________________________________________ dense_3 (Dense) (None, 5) 130 ================================================================= Total params: 2,144,155 Trainable params: 1,706,855 Non-trainable params: 437,300 _________________________________________________________________
# Using callback function to stop the model the loss is not reducing or accuracy is not improving
early = EarlyStopping(monitor='val_loss', patience=5, verbose=1, min_delta=0.0001, mode='auto')
reduce_learning = ReduceLROnPlateau(patience=5, verbose=1, min_lr=1e-6, factor=0.2)
# model_cp = ModelCheckpoint('Industrial_chatbot.h5',monitor='val_loss', save_best_only= True, mode= 'min', verbose=1)
callback_list = [early, reduce_learning]
nn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks= [callback_list])
Epoch 1/100 12/12 [==============================] - 0s 35ms/step - loss: 2.1779 - accuracy: 0.1937 - val_loss: 1.6959 - val_accuracy: 0.1395 Epoch 2/100 12/12 [==============================] - 0s 13ms/step - loss: 1.8504 - accuracy: 0.2801 - val_loss: 1.6548 - val_accuracy: 0.2093 Epoch 3/100 12/12 [==============================] - 0s 14ms/step - loss: 1.7493 - accuracy: 0.2775 - val_loss: 1.6127 - val_accuracy: 0.2093 Epoch 4/100 12/12 [==============================] - 0s 14ms/step - loss: 1.6123 - accuracy: 0.3272 - val_loss: 1.5720 - val_accuracy: 0.2791 Epoch 5/100 12/12 [==============================] - 0s 17ms/step - loss: 1.5628 - accuracy: 0.3325 - val_loss: 1.5401 - val_accuracy: 0.3023 Epoch 6/100 12/12 [==============================] - 0s 15ms/step - loss: 1.4556 - accuracy: 0.3796 - val_loss: 1.5022 - val_accuracy: 0.3256 Epoch 7/100 12/12 [==============================] - 0s 15ms/step - loss: 1.4471 - accuracy: 0.4319 - val_loss: 1.4893 - val_accuracy: 0.3023 Epoch 8/100 12/12 [==============================] - 0s 13ms/step - loss: 1.4115 - accuracy: 0.4136 - val_loss: 1.4716 - val_accuracy: 0.3488 Epoch 9/100 12/12 [==============================] - 0s 14ms/step - loss: 1.3703 - accuracy: 0.4372 - val_loss: 1.4190 - val_accuracy: 0.3953 Epoch 10/100 12/12 [==============================] - 0s 14ms/step - loss: 1.3261 - accuracy: 0.4503 - val_loss: 1.3947 - val_accuracy: 0.4419 Epoch 11/100 12/12 [==============================] - 0s 13ms/step - loss: 1.3285 - accuracy: 0.4529 - val_loss: 1.3743 - val_accuracy: 0.4186 Epoch 12/100 12/12 [==============================] - 0s 13ms/step - loss: 1.3131 - accuracy: 0.4555 - val_loss: 1.3564 - val_accuracy: 0.4419 Epoch 13/100 12/12 [==============================] - 0s 14ms/step - loss: 1.1798 - accuracy: 0.5079 - val_loss: 1.3489 - val_accuracy: 0.3953 Epoch 14/100 12/12 [==============================] - 0s 15ms/step - loss: 1.1801 - accuracy: 0.5052 - val_loss: 1.3421 - val_accuracy: 0.3721 Epoch 15/100 12/12 [==============================] - 0s 16ms/step - loss: 1.1591 - accuracy: 0.5262 - val_loss: 1.3348 - val_accuracy: 0.4651 Epoch 16/100 12/12 [==============================] - 0s 18ms/step - loss: 1.1087 - accuracy: 0.5628 - val_loss: 1.3269 - val_accuracy: 0.4186 Epoch 17/100 12/12 [==============================] - 0s 17ms/step - loss: 1.0700 - accuracy: 0.5785 - val_loss: 1.3221 - val_accuracy: 0.4186 Epoch 18/100 12/12 [==============================] - 0s 15ms/step - loss: 1.0208 - accuracy: 0.5995 - val_loss: 1.3240 - val_accuracy: 0.3953 Epoch 19/100 12/12 [==============================] - 0s 14ms/step - loss: 0.9849 - accuracy: 0.6047 - val_loss: 1.3343 - val_accuracy: 0.3953 Epoch 20/100 12/12 [==============================] - 0s 14ms/step - loss: 0.8848 - accuracy: 0.6806 - val_loss: 1.3345 - val_accuracy: 0.3953 Epoch 21/100 12/12 [==============================] - 0s 19ms/step - loss: 0.8925 - accuracy: 0.6440 - val_loss: 1.3289 - val_accuracy: 0.3488 Epoch 22/100 12/12 [==============================] - 0s 20ms/step - loss: 0.8362 - accuracy: 0.6859 - val_loss: 1.3173 - val_accuracy: 0.3953 Epoch 23/100 12/12 [==============================] - 0s 19ms/step - loss: 0.8207 - accuracy: 0.7068 - val_loss: 1.3180 - val_accuracy: 0.3488 Epoch 24/100 12/12 [==============================] - 0s 14ms/step - loss: 0.8153 - accuracy: 0.6963 - val_loss: 1.3232 - val_accuracy: 0.3488 Epoch 25/100 12/12 [==============================] - 0s 14ms/step - loss: 0.7829 - accuracy: 0.7251 - val_loss: 1.3246 - val_accuracy: 0.3488 Epoch 26/100 12/12 [==============================] - 0s 13ms/step - loss: 0.7095 - accuracy: 0.7539 - val_loss: 1.3291 - val_accuracy: 0.4186 Epoch 27/100 10/12 [========================>.....] - ETA: 0s - loss: 0.6817 - accuracy: 0.7656 Epoch 00027: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026. 12/12 [==============================] - 0s 13ms/step - loss: 0.6865 - accuracy: 0.7487 - val_loss: 1.3317 - val_accuracy: 0.3953 Epoch 00027: early stopping
<tensorflow.python.keras.callbacks.History at 0x2795e167130>
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, cmap='plasma',
xticklabels=['1','2','3','4','5'], yticklabels=['1','2','3','4','5'],).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
ytest_predict = nn_model.predict(X_test)
ytest_predict_binary = ytest_predict >= 0.5
print_confusion_matrix(y_test.argmax(axis=1), ytest_predict.argmax(axis=1))
print(classification_report(y_test.argmax(axis=1), ytest_predict.argmax(axis=1), target_names=['1','2','3','4','5']))
precision recall f1-score support
1 1.00 0.14 0.25 7
2 0.27 0.23 0.25 13
3 0.10 0.17 0.12 6
4 0.57 0.75 0.65 16
5 0.00 0.00 0.00 1
accuracy 0.40 43
macro avg 0.39 0.26 0.25 43
weighted avg 0.47 0.40 0.38 43
The Neural Network model is not learning well. Accuracy is 40%
# Initializing the model
clear_session()
LSTM_model = Sequential()
# Embedding layer
LSTM_model.add(Embedding(input_dim= num_words, output_dim= embedding_vector_length,
weights = [embedding_matrix],
trainable = False,
input_length = maxlen))
LSTM_model.output
<tf.Tensor 'embedding/embedding_lookup/Identity_1:0' shape=(None, 85, 200) dtype=float32>
# Adding the Bidirectional LSTM layer with 128 units
LSTM_model.add(Bidirectional(LSTM(50, return_sequences = True, dropout= 0.4)))
# Adding global pooling to make it 1D
LSTM_model.add(GlobalMaxPooling1D())
# Adding dropout to avoid overfitting
LSTM_model.add(Dropout(0.4))
# Adding output layer
# LSTM_model.add(Dense(6, activation='softmax'))
LSTM_model.add(Dense(5, activation = 'softmax'))
LSTM_model.output
<tf.Tensor 'dense/Softmax:0' shape=(None, 5) dtype=float32>
LSTM_model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 85, 200) 437000 _________________________________________________________________ bidirectional (Bidirectional (None, 85, 100) 100400 _________________________________________________________________ global_max_pooling1d (Global (None, 100) 0 _________________________________________________________________ dropout (Dropout) (None, 100) 0 _________________________________________________________________ dense (Dense) (None, 5) 505 ================================================================= Total params: 537,905 Trainable params: 100,905 Non-trainable params: 437,000 _________________________________________________________________
# Compiling the model
LSTM_model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
# Using callback function to stop the model the loss is not reducing or accuracy is not improving
early = EarlyStopping(monitor='val_loss', patience=15, verbose=1, min_delta=0.0001, mode='auto')
reduce_learning = ReduceLROnPlateau(patience=15, verbose=1, min_lr=1e-6, factor=0.2)
model_cp = ModelCheckpoint('Industrial_chatbot.h5',monitor='val_loss', save_best_only= True, verbose=1,)
callback_list = [early, reduce_learning, model_cp]
LSTM_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=25, batch_size=32, callbacks= [callback_list])
Epoch 1/25 12/12 [==============================] - ETA: 0s - loss: 0.6659 - accuracy: 0.7592 Epoch 00001: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 84ms/step - loss: 0.6659 - accuracy: 0.7592 - val_loss: 1.4066 - val_accuracy: 0.5814 Epoch 2/25 12/12 [==============================] - ETA: 0s - loss: 0.7139 - accuracy: 0.7513 Epoch 00002: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 81ms/step - loss: 0.7139 - accuracy: 0.7513 - val_loss: 1.4527 - val_accuracy: 0.4651 Epoch 3/25 12/12 [==============================] - ETA: 0s - loss: 0.6668 - accuracy: 0.7644 Epoch 00003: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 81ms/step - loss: 0.6668 - accuracy: 0.7644 - val_loss: 1.4039 - val_accuracy: 0.4651 Epoch 4/25 12/12 [==============================] - ETA: 0s - loss: 0.6664 - accuracy: 0.7487 Epoch 00004: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 81ms/step - loss: 0.6664 - accuracy: 0.7487 - val_loss: 1.3787 - val_accuracy: 0.5349 Epoch 5/25 12/12 [==============================] - ETA: 0s - loss: 0.6053 - accuracy: 0.7853 Epoch 00005: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 79ms/step - loss: 0.6053 - accuracy: 0.7853 - val_loss: 1.3527 - val_accuracy: 0.5116 Epoch 6/25 12/12 [==============================] - ETA: 0s - loss: 0.5796 - accuracy: 0.8115 Epoch 00006: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 78ms/step - loss: 0.5796 - accuracy: 0.8115 - val_loss: 1.3837 - val_accuracy: 0.5349 Epoch 7/25 12/12 [==============================] - ETA: 0s - loss: 0.5427 - accuracy: 0.8010 Epoch 00007: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 78ms/step - loss: 0.5427 - accuracy: 0.8010 - val_loss: 1.4777 - val_accuracy: 0.4419 Epoch 8/25 12/12 [==============================] - ETA: 0s - loss: 0.5319 - accuracy: 0.8168 Epoch 00008: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 91ms/step - loss: 0.5319 - accuracy: 0.8168 - val_loss: 1.4339 - val_accuracy: 0.5116 Epoch 9/25 12/12 [==============================] - ETA: 0s - loss: 0.4670 - accuracy: 0.8743 Epoch 00009: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 85ms/step - loss: 0.4670 - accuracy: 0.8743 - val_loss: 1.6079 - val_accuracy: 0.3953 Epoch 10/25 12/12 [==============================] - ETA: 0s - loss: 0.4753 - accuracy: 0.8586 Epoch 00010: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 85ms/step - loss: 0.4753 - accuracy: 0.8586 - val_loss: 1.5619 - val_accuracy: 0.4186 Epoch 11/25 12/12 [==============================] - ETA: 0s - loss: 0.4919 - accuracy: 0.8089 Epoch 00011: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 88ms/step - loss: 0.4919 - accuracy: 0.8089 - val_loss: 1.6301 - val_accuracy: 0.4186 Epoch 12/25 12/12 [==============================] - ETA: 0s - loss: 0.4366 - accuracy: 0.8691 Epoch 00012: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 84ms/step - loss: 0.4366 - accuracy: 0.8691 - val_loss: 1.4521 - val_accuracy: 0.5349 Epoch 13/25 12/12 [==============================] - ETA: 0s - loss: 0.4511 - accuracy: 0.8534 Epoch 00013: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 91ms/step - loss: 0.4511 - accuracy: 0.8534 - val_loss: 1.6126 - val_accuracy: 0.3488 Epoch 14/25 12/12 [==============================] - ETA: 0s - loss: 0.4970 - accuracy: 0.8194 Epoch 00014: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 92ms/step - loss: 0.4970 - accuracy: 0.8194 - val_loss: 1.6418 - val_accuracy: 0.4651 Epoch 15/25 12/12 [==============================] - ETA: 0s - loss: 0.4343 - accuracy: 0.8639 Epoch 00015: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 87ms/step - loss: 0.4343 - accuracy: 0.8639 - val_loss: 1.5789 - val_accuracy: 0.4884 Epoch 16/25 12/12 [==============================] - ETA: 0s - loss: 0.3678 - accuracy: 0.8927 Epoch 00016: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 97ms/step - loss: 0.3678 - accuracy: 0.8927 - val_loss: 1.6879 - val_accuracy: 0.3953 Epoch 17/25 12/12 [==============================] - ETA: 0s - loss: 0.3595 - accuracy: 0.8822 Epoch 00017: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 89ms/step - loss: 0.3595 - accuracy: 0.8822 - val_loss: 1.5123 - val_accuracy: 0.5116 Epoch 18/25 12/12 [==============================] - ETA: 0s - loss: 0.3341 - accuracy: 0.8979 Epoch 00018: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 82ms/step - loss: 0.3341 - accuracy: 0.8979 - val_loss: 1.5857 - val_accuracy: 0.4884 Epoch 19/25 12/12 [==============================] - ETA: 0s - loss: 0.2506 - accuracy: 0.9346 Epoch 00019: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 83ms/step - loss: 0.2506 - accuracy: 0.9346 - val_loss: 1.8481 - val_accuracy: 0.5349 Epoch 20/25 12/12 [==============================] - ETA: 0s - loss: 0.3112 - accuracy: 0.8979 Epoch 00020: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026. Epoch 00020: val_loss did not improve from 1.26361 12/12 [==============================] - 1s 96ms/step - loss: 0.3112 - accuracy: 0.8979 - val_loss: 1.5998 - val_accuracy: 0.5116 Epoch 00020: early stopping
<tensorflow.python.keras.callbacks.History at 0x27961ca1f10>
# Checking the history of the model
plt.plot(LSTM_model.history.history['val_loss']);
# Checking the history of the model
plt.plot(LSTM_model.history.history['val_accuracy']);
model_LSTM = LSTM_model.save(working_dir+ 'InterimReportIndustrial_chatbot.h5')
from tensorflow import keras
LSTM_model = keras.models.load_model(working_dir + 'InterimReportIndustrial_chatbot.h5')
# Evaluating the model
test_result = LSTM_model.evaluate(X_test, y_test)
2/2 [==============================] - 0s 7ms/step - loss: 1.5998 - accuracy: 0.5116
print('Test accuracy of the model:{0:.2%}'.format(test_result[1]))
Test accuracy of the model:51.16%
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, cmap='plasma',
xticklabels=['1','2','3','4','5'], yticklabels=['1','2','3','4','5'],).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
ytest_predict = LSTM_model.predict(X_test)
ytest_pred_binary = ytest_predict>=0.5
print_confusion_matrix(y_test.argmax(axis=1), ytest_predict.argmax(axis=1))
print(classification_report(y_test.argmax(axis=1), ytest_predict.argmax(axis=1), target_names=['1','2','3','4','5']))
precision recall f1-score support
1 0.60 0.43 0.50 7
2 0.43 0.69 0.53 13
3 0.17 0.17 0.17 6
4 0.89 0.50 0.64 16
5 0.50 1.00 0.67 1
accuracy 0.51 43
macro avg 0.52 0.56 0.50 43
weighted avg 0.59 0.51 0.52 43
Bi Directional LSTM is working with best acuracy of 51%. Model needs more data cleaning.
import fasttext
#from gensim.models.fasttext import FastText
import csv
ft_df_Potential = pd.DataFrame(columns=['fasttext_data'])
#ft_df_Potential['fasttext_data'] ='__label__' + ds['Potential Accident Level'].astype(str) + " "+ds['clean_Description']
ft_df_Potential['fasttext_data'] ='__label__' + ds['Potential Accident Level'].astype(str) + " "+ds['claen_Description']
train_Potential = ft_df_Potential.head(340)
valid_Potential = ft_df_Potential.tail(85)
train_Potential.to_csv(r'train_Potential.train', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
valid_Potential.to_csv(r'valid_Potential.valid', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
train_Potential = open("train_Potential.train","a")
valid_Potential = open("valid_Potential.valid","a")
#model = fasttext.train_supervised(input="E:\\Great Learning\\DL\\Capstone\\Data\\Output\\InterimReport\\train_Potential.train", lr=0.7, epoch=300, wordNgrams=2, bucket=200000, dim=50, loss='ova')
model = fasttext.train_supervised(input="train_Potential.train", lr=0.5, epoch=300, wordNgrams=2, bucket=200000, dim=50, loss='ova')
model.predict('forklift went manipulate big bag bioxide section front ladder leads manual displacement splashed spent height forehead fissure pipe subsequently spilling left eye went nearby eyewash cleaning immediately medical center')
(('__label__2',), array([0.93246335]))
model.test("valid_Potential.valid")
(85, 0.43529411764705883, 0.43529411764705883)
model.save_model("ft_model_Potential.bin")
ft_model_Potential = fasttext.train_supervised(input="train_Potential.train", epoch=300)
ft_model_Potential.test("/home/mario/Great learning/RASA_GL/temp/valid_Potential.valid")
#With both Lr and EPOC combined
ft_model_Potential = fasttext.train_supervised(input="train_Potential.train", lr=0.7, epoch=300)
ft_model_Potential.test("valid_Potential.valid")
(85, 0.4470588235294118, 0.4470588235294118)
ft_model_Potential = fasttext.train_supervised(input="train_Potential.train", lr=0.7, epoch=300, wordNgrams=1)
ft_model_Potential.test("valid_Potential.valid")
(85, 0.4588235294117647, 0.4588235294117647)
ft_model_Potential = fasttext.train_supervised(input="train_Potential.train", lr=0.7, epoch=300,bucket=200000, dim=50, loss='hs')
ft_model_Potential.test("valid_Potential.valid")
(85, 0.3764705882352941, 0.3764705882352941)
# Juat tried with Multi labels too
ft_model_Potential = fasttext.train_supervised(input="train_Potential.train", lr=0.7, epoch=300, bucket=200000, dim=50, loss='ova')
ft_model_Potential.test("valid_Potential.valid")
(85, 0.43529411764705883, 0.43529411764705883)
One of the main reasons for not achieving very high accuracy could be the lack of large labeled text datasets. Most of the labeled text datasets are not big enough to train deep neural networks because these networks have a huge number of parameters and training such networks on small datasets will cause overfitting.
We will probably use the this last approach. We will freeze all the layers of BERT during fine-tuning and append a dense layer and a softmax layer to the architecture.
We want to build a model that performs robustly and to this effect, we use the same set of hyper parameters across tasks and validation set. We shall also explore AWD-LSTM language model (Merity et al., 2017a) with an embedding size of 400, 3 layers,1150 hidden activations per layer, and a BPTT batch size of 70. We apply dropout of 0.4 to layers, 0.3 to RNN layers, 0.4 to input embed-ding layers, 0.05 to embedding layers, and weight dropout of 0.5 to the RNN hidden-to-hidden matrix. The classifier has a hidden layer of size 50.
We use Adam with β1= 0.7 instead of the de-fault β1= 0.9 and β2= 0.99, We use a batch size of 64, a base learning rate of 0.004 and 0.01 for fine tuning the language models and the classifier respectively.